This repository contains the necessary code to run the experiments in the paper "ExDDV: A New Dataset for Explainable Deepfake Detection in Video". If you use this dataset or code in your research, please cite the corresponding paper:
- Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu. ExDDV: A New Dataset for Explainable Deepfake Detection in Video. arXiv preprint arXiv:2503.14421 (2025).
Bibtex:
@article{hondru2025exddv,
title={ExDDV: A New Dataset for Explainable Deepfake Detection in Video},
author={Hondru, Vlad and Hogea, Eduard and Onchis, Darian and Ionescu, Radu Tudor},
journal={arXiv preprint arXiv:2503.14421},
year={2025}
}
The csv file containing the annotations can be found in this repository. Use the columns movie_name
and dataset
to determine the input movie name.
There are three different folders inside the src
folder containing the three models we used: LAVIS
(BLIP-2), LLaVA
and Phi3-Vision-Finetune
. These are forked from the following repositories: LAVIS, LLaVA and Phi3-Vision-Finetune.
Each model is trained as per its corresponding repository. We used the respective training scripts: BLIP-2, LLaVA and Phi3-Vision. The only modification needed is to replaced the paths to the datasets in the corresponding files.
!Note for Phi3-Vision: The file processing_phi3_v.py
from HuggingFace Transformers must be replaced with the script from here.
The current implementation uses LLaVA 1.5, BLIP-2 and PHI3-Vision.
Note:
The code for each model is provided in separate Jupyter notebooks:
- LLAVA:
llava_incontext.ipynb
- PHI3-Vision:
phi_incontext.ipynb
- BLIP-2:
blip_incontext.ipynb
- Extracting visual embeddings: from training and test video frames using a pre-trained vision model.
- Constructing in-context prompts: by retrieving the top-k most similar training annotations.
- Analyzing deepfake artifacts: in test frames using a language-vision model (e.g., LLAVA).
- Optionally applying spatial masks: to focus the analysis on specific facial regions.
- Dataset names (e.g., "FaceForensics++").
- Locates movie files based on dataset and manipulation type.
- Extracts training and test frames from a CSV file.
- Uses a vision model (e.g., CLIP with RN101) with support for different extraction layers ("first", "middle", "last") to compute image embeddings.
- Generates custom prompts from training annotations using multiple prompt templates (different version will influence the level of detail in the response).
- Loads the model and evaluates test frames using generated prompts.
- Applies optional hard or soft masks around keypoints.
- Compares test frame embeddings with training embeddings using cosine similarity.
- Retrieves top-k training examples to form a contextual prompt.
- Saves detailed results to CSV files.
Requirements will differ based on the vision model used. We have followed the indications from the original paper. For LLAVA, for example, use the following steps:
- Clone the LLAVA repository and navigate to the LLaVA folder:
git clone https://github.com/haotian-liu/LLaVA.git cd LLaVA
- Install the package in a Conda environment:
conda create -n llava python=3.10 -y conda activate llava pip install --upgrade pip # enable PEP 660 support pip install -e .
For other vision models (e.g., BLIP-2, PHI3-Vision), refer to their official installation instructions.
-
Dataset CSV:
Ensure your dataset CSV (e.g.,dataset_last.csv
) includes columns such asmovie_name
,dataset
,manipulation
,movie_path
,click_locations
, andtext
. The CSV should be split into training, validation(not used here), and test sets. -
Video Files:
Place your video files in thedata/
folder. The directory structure should follow the dataset names (e.g.,data/Faceforensics++/
).
The pipeline outputs CSV files containing:
- Test video information (file path and frame number).
- Ground truth annotations.
- Contextual prompts from top-k training annotations.
- Cosine similarity scores.
- Deepfake analysis results.