This document provides an overview of deepfake generation and detection. It begins with an introduction to the author and their background and research interests. The rest of the document is outlined as follows: definitions of deepfakes, various deepfake generation techniques including face synthesis, manipulation, reenactment and swapping, and an overview of deepfake detection methods including commonly used datasets, image-based and video-based detection approaches.
2. 2
About me
Education:
• 2009-2013: BS, University of Science, Vietnam National Univerisity
– Ho Chi Minh City.
• 2016-Now: Ph.D Candidate (Echizen Lab), The Graduate University
for Advanced Studies (SOKENDAI), in association with the National
Institute of Informatics, Japan.
Research topics: Machine learning, deepfake detection, biometrics.
Research Contributions:
• Reviewer: APSIPA, ICME, IEEE Access, IEEE TIFS, IEEE/CAA JAS.
• APSIPA 2020 Special Session Chair: Deep Generative Models for
Media Clones and Its Detection.
Huy H. Nguyen
nhhuy@nii.ac.jp
5. 5
1. What is Deepfake?
Deepfake / facial generation & manipulation:
• Entire face synthesis
• Attribute manipulation: hair, skin color, expression
• Facial reenactment
• Speaking manipulation
• Face swap
6. 6
1. Examples of Deepfake
Realistic images generated by StyleGAN 2
(Karras et al. 2020)
More examples can be found at
https://thispersondoesnotexist.com/
French charity published a deepfake of
Trump saying 'AIDS is over’
Source: Euronews
7. 7
1. Deepfake’s threats
Breaking authentication systems
à Identity thief
Chingovska et al. 2012
Pornography
Fraudulent /
Spying purpose
Image: The Verge
Spreading disinformation
Image: CNN
Breaking border control
Image: MIT Technology Review
Phony, blackmail
Image: Military Times
9. 9
2.1. Entire Face Synthesis
VAEs vs. GANs
StyleGAN / StyleGAN 21 (Karras et al. 2019/2020).
Using progressive training strategy and a style-based
image generation approach.
VQ-VAE 2 (Razavi et al. 2019).
Using multi-stage image generation strategy.
- Razavi, Ali, Aaron van den Oord, and Oriol Vinyals. "Generating diverse high-fidelity images with vq-vae-2." NeurIPS (2019).
- Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." CVPR, pp. 4401-4410. 2019.
- Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. "Analyzing and improving the image quality of stylegan." CVPR, pp. 8110-8119. 2020.
- 1 Demo can be found at: https://thispersondoesnotexist.com/
10. 10
2.2. Attribute Manipulation
ELEGANT (Xiao et al. 2018).
Exchanging latent encodings for
transferring multiple face attributes.
StarGAN (Choi et al. 2018).
Image-to-image translation for multiple domains.
- Choi, Yunjey, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. "Stargan: Unified generative adversarial networks for multi-domain image-to-image translation." CVPR, pp.
8789-8797. 2018.
- Xiao, Taihong, Jiapeng Hong, and Jinwen Ma. "ELEGANT: Exchanging latent encodings with GAN for transferring multiple face attributes." ECCV, pp. 168-184. 2018.
11. 11
2.3. Facial Reenactment
Video (attacker) + video (victim) à forged video
Face2Face (Thies et al. 2016).
Transferring facial movements
of one person to the other one.
Deep Video Portraits (Kim et al. 2018).
Extension of Face2Face with the
addition of transferring head poses.
- Thies, Justus, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. "Face2Face: Real-time face capture and reenactment of RGB videos." CVPR, pp. 2387-2395. 2016.
- Kim, Hyeongwoo, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. "Deep video
portraits." ACM Transactions on Graphics (TOG) 37, no. 4 (2018): 1-14.
12. 12
2.3. Facial Reenactment
Video (attacker) + video (victim) à forged video
Head2Head++
(Doukas et al. 2021)
- Doukas, Michail Christos, Mohammad Rami Koujan, Viktoriia Sharmanska, Anastasios Roussos, and Stefanos Zafeiriou. "Head2Head++: Deep Facial Attributes Re-Targeting." IEEE Transactions on
Biometrics, Behavior, and Identity Science 3, no. 1 (2021): 31-43.
- Thies, Justus, Michael Zollhöfer, and Matthias Nießner. "Deferred neural rendering: Image synthesis using neural textures." ACM Transactions on Graphics (TOG) 38, no. 4 (2019): 1-12.
NeuralTextures
(Thies et al. 2019)
13. 13
2.3. Facial Reenactment
Video (attacker) + image (victim) à forged video
Bringing Portraits to Life
(Averbuch-Elor et al. 2017)
ICFace
(Tripathy et al. 2020)
- Averbuch-Elor, Hadar, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. "Bringing portraits to life." ACM Transactions on Graphics (TOG) 36, no. 6 (2017): 1-13.
- Zakharov, Egor, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. "Few-shot adversarial learning of realistic neural talking head models." ICCV, pp. 9459-9468. 2019.
- Tripathy, Soumya, Juho Kannala, and Esa Rahtu. "ICFace: Interpretable and controllable face reenactment using GANs." WACV, pp. 3385-3394. 2020.
Neural Talking Head Models
(Zakharov et al. 2019)
14. 14
2.4. Speaking Manipulation
Synthesized speech (attacker) + image/video (victim) à forged video
Speech2Vid
(Jamaludin et al. 2020)
Synthesizing Obama
(Suwajanakorn et al. 2017)
- Jamaludin, Amir, Joon Son Chung, and Andrew Zisserman. "You said that?: Synthesising talking faces from audio." International Journal of Computer Vision 127, no. 11 (2019): 1767-1779..
- Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing obama: learning lip sync from audio." ACM Transactions on Graphics (ToG) 36, no. 4 (2017): 1-13.
15. 15
2.4. Speaking Manipulation
Modified text (attacker) + video (victim) à forged video
Text-based Editing of Talking-head Video
(Fried et al. 2019)
Fried, Ohad, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. "Text-based editing of talking-head
video." ACM Transactions on Graphics (TOG) 38, no. 4 (2019): 1-14.
16. 16
2.5. Face Swap
Traditional (computer graphic based) face swap
- Bitouk, Dmitri, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K. Nayar. "Face swapping: automatically replacing faces in photographs." SIGGRAPH 2008, pp. 1-8. 2008.
- 1 A course project from the Warsaw University of Technology. Access at https://github.com/MarekKowalski/FaceSwap
FaceSwap1
(Kowalski 2016)
Face Swapping
(Bitouk et al. 2008)
17. 17
2.5. Face Swap
Deep learning based face swap
Original Deepfake (Faceswap)1
Image: Alan Zucconi
Faceswap – GAN2
Image: shaoanlu
1 https://github.com/deepfakes/faceswap
2 https://github.com/shaoanlu/faceswap-GAN
20. 20
3. Overview of Deepfake Detection
Deepfake
detection
Input
Image/
video frame
Video
Output
Classification
Segmentation
Feature
extraction
Hand-crafted
Automatic
(deep learning)
Semi-
automatic
Architecture
Single network
Two-stream
Multi-task
learning
Ensemble
21. 21
3.1. Datasets
Dataset Year #Real #Fake #Person Manipulation Methods
DF-TIMIT
1
2018 320 320 1 Deepfake
UADFV
2
2018 49 49 1 Deepfake
FaceForensics++
3
2019 1,000 5,000 1 • Deepfake family
• Face2Face
• FaceSwap
• NeuralTextures
• FaceShifter
Google DFD
4
2019 363 3,068 1 Deepfake
Facebook DFDC
5
2020 23,654 104,500 ~1 Various
Celeb-DF
6
2020 590 5,639 1 Deepfake
DeeperForensics
7
2020 1,000
(from FF++)
1,000 (raw)
→ 10,000 (aug.)
1 DeepFake-VAE
WildDeepfake
8
2020 707 1 No information
Face Forensics in
the Wild (FFIW)
9
2021 10,000 10,000 3.15 • DeepFaceLab
• FaceSwap
• FaceSwap-GAN
1 Korshunov, P. and Marcel, S., 2018. Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685.
2 Li, Yuezun, Ming-Ching Chang, and Siwei Lyu. "In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking." WIFS. 2018.
3 Rossler, Andreas, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. "Faceforensics++: Learning to detect manipulated facial images." ICCV. 2019.
4 Google AI blog. Contributing data to deepfake detection research. Access at https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html. 2019
5 Dolhansky, Brian, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. "The deepfake detection challenge dataset." arXiv preprint arXiv:2006.07397 (2020).
6 Li, Yuezun, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. "Celeb-DF: A large-scale challenging dataset for deepfake forensics." CVPR. 2020.
7 Jiang, Liming, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. "Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection." CVPR. 2020.
8 Zi, Bojia, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. "WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection." ACM Multimedia. 2020.
9 Zhou, Tianfei, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen. "Face Forensics in the Wild." CVPR. 2021.
FaceForensics++ DFDC DeeperForensics Celeb-DF
22. 22
3.2. Image-based Deepfake Detection
Using hand-crafted residuals
to extract features and
an ensemble classifier
(Fridrich and Kodovsky. 2012).
- Fridrich, Jessica, and Jan Kodovsky. "Rich models for steganalysis of digital images." IEEE Transactions on Information Forensics and Security 7, no. 3 (2012): 868-882.
- Cozzolino, Davide, Giovanni Poggi, and Luisa Verdoliva. "Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection." ACM Workshop on
Information Hiding and Multimedia Security, pp. 159-164. 2017.
- Bayar, Belhassen, and Matthew C. Stamm. "A deep learning approach to universal image manipulation detection using a new convolutional layer." ACM Workshop on Information Hiding and
Multimedia Security. 2016.
Reimplementing Fridrich
and Kodovsky’s work
as a CNN
(Cozzolino et al. 2017).
Proposing a new convolutional layer.
The coefficients in the
green region sum to 1.
(Bayar and Stamm. 2016).
23. 23
3.2. Image-based Deepfake Detection
CNN-based single network deepfake detectors
MesoNet (Afchar et al. 2018) is a
compact network using residual
blocks (He et al. 2016).
Applying transfer learning on XceptionNet
(Chollet et al. 2017) for deepfake
detection (Rossler et al. 2019).
EfficientNet (Tan and Le 2019) is another
solid architecture for deepfake detection
which achieved high score in the DFDC
(Dolhansky et al 2020).
- Afchar, Darius, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. "Mesonet: a compact facial video forgery detection network." WIFS. IEEE, 2018.
- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." CVPR. 2016.
- Chollet, François. "Xception: Deep learning with depthwise separable convolutions." CVPR. 2017.
- Rossler, Andreas, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. "Faceforensics++: Learning to detect manipulated facial images." ICCV. 2019.
- Tan, Mingxing, and Quoc Le. "EfficientNet: Rethinking model scaling for convolutional neural networks." ICML. PMLR, 2019.
- Dolhansky, Brian, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. "The deepfake detection challenge dataset." arXiv preprint arXiv:2006.07397 (2020).
24. 24
3.2. Image-based Deepfake Detection
Two-stream network deepfake detectors
Two-stream network, one branch takes RGB
input, the other takes steganalysis feature
and using triplet loss (Zhou et al. 2017).
- Zhou, Peng, Xintong Han, Vlad I. Morariu, and Larry S. Davis. "Two-stream neural networks for tampered face detection." CVPRW. IEEE, 2017.
- Qian, Yuyang, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. "Thinking in frequency: Face forgery detection by mining frequency-aware clues." ECCV. Springer, Cham, 2020.
The two-stream F3-Net with the frequency-aware
decomposition (FAD) branch and the local frequency
statistics (LFS) branch (Qian et al. 2020).
25. 25
3.2. Image-based Deepfake Detection
Complex network deepfake detectors
8
64
Batch
Norm
2D
Conv
ReLU
Batch
Norm
2D
Conv
ReLU
Batch
Norm
1D
Conv
Batch
Norm
1D
Conv
Stats
Pooling
Batch
Norm
2D
Conv
ReLU
Batch
Norm
2D
Conv
ReLU
Batch
Norm
1D
Conv
Batch
Norm
1D
Conv
Stats
Pooling
Batch
Norm
2D
Conv
ReLU
Batch
Norm
2D
Conv
ReLU
Batch
Norm
1D
Conv
Batch
Norm
1D
Conv
Stats
Pooling
…
…
…
Feature
extractor
Real
capsule
Fake
capsule
Softmax
Mean
Dynamic
routing
Primary capsules Output
capsules
Final
output
3×3
stride 1
3×3
stride 1
5×1
stride 2
3×1
stride 1
16 1
4×1 vector
4×1 vector
Output
depth
A B C
!(")
"(")
!($)
"($)
!(%)
"(%)
#(")
#($)
#(%)
$(")
$($)
%
&
Capsule network (Sabour et al. 2017) based detector
the Capsule-Forensics (Nguyen et al. 2019) with
statistical pooling layers (Rahmouni et al. 2016) used
by the primary capsules.
- Dang, Hao, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K. Jain. "On the detection of digital face manipulation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 5781-5790. 2020.
- Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic routing between capsules." NIPS. 2017.
- Nguyen, Huy H., Junichi Yamagishi, and Isao Echizen. "Capsule-forensics: Using capsule networks to detect forged images and videos." ICASSP. IEEE, 2019.
- Rahmouni, Nicolas, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. "Distinguishing computer graphics from natural images using convolution neural networks." WIFS. IEEE, 2017.
Two-stream attention-based deepfake detector, one
branch uses manipulation appearance model (MAM),
the other uses direct regression (Dang et al. 2020).
26. 26
Multi-task learning & active learning to improve generalization
3.3. Image-based Deepfake Segmentation
Latent
Label
[0, 1]
Activation
Selection
Recon-
structed
image
a
Encoder
Shared
decoder
Recon-
struction
branch
Segmen-
tation
branch
Multi-task learning combining
detection, segmentation and
reconstruction (Nguyen et al. 2019).
- Nguyen, Huy H., Fuming Fang, Junichi Yamagishi, and Isao Echizen. "Multi-task learning for detecting and segmenting manipulated facial images and videos." BTAS. IEEE, 2019.
- Du, Mengnan, Shiva Pentyala, Yuening Li, and Xia Hu. "Towards Generalizable Deepfake Detection with Locality-aware AutoEncoder." International Conference on Information & Knowledge
Management. 2020.
Locally-aware autoencoder with attention
loss and active learning (Du et al. 2020).
27. 27
3.3. Image-based Deepfake Segmentation
- Wang, Sheng-Yu, Oliver Wang, Andrew Owens, Richard Zhang, and Alexei A. Efros. "Detecting photoshopped faces by scripting photoshop." ICCV. 2019.
- Li, Lingzhi, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. "Face x-ray for more general face forgery detection." CVPR. 2020.
- Chai, Lucy, David Bau, Ser-Nam Lim, and Phillip Isola. "What makes fake images detectable? Understanding properties that generalize." ECCV. Springer, Cham, 2020.
Face X-ray focusing on blending
area instead of manipulated area
(Li et al. 2020).
Using patch classifier to generate
heatmap (Chai et al. 2020).
Using dilated residual network
(DRN) to detect photoshopped
region (Wang et al. 2019).
28. 28
3.4. Video-based Deepfake Detection
- Li, Yuezun, Ming-Ching Chang, and Siwei Lyu. "In Ictu Oculi: Exposing AI generated fake face videos by detecting eye blinking." WIFS. 2018.
- Agarwal, Shruti, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. "Protecting World Leaders Against Deep Fakes." CVPRW. 2019.
- Ciftci, Umur Aybars, Ilke Demir, and Lijun Yin. "Fakecatcher: Detection of synthetic portrait videos using biological signals." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
Biological-inspired approaches
Detecting eye blinking
(Li et al. 2018).
Modeling facial expression movements
(Agarwal et al. 2019).
Using photoplethysmography
(PPG) (Ciftci et al. 2020).
29. 29
3.4. Video-based Deepfake Detection
- Sabir, Ekraam, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. "Recurrent convolutional strategies for face manipulation detection in videos." CVPRW. 2019.
- Zhou, Tianfei, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen. "Face Forensics in the Wild." CVPR. 2021.
Automatically feature extraction approaches
Multi-person deepfake detection using multi-
temporal-scale instance feature aggregation
and bag feature aggregation (Zhou et al. 2021).
Automatically feature extraction in both spacial
and temporal domains (Sabir et al. 2019).
30. 30
3.5. Generalizability
Cross-domain deepfake detection is still challenging!
Performances of several detectors (trained on
FaceForensics++ dataset) on the Google DFD
dataset. Although having high performances
(over 90%) on the FaceForensics++ dataset, they
still struggle with the domain mismatch issue.
Capsule-Forensics
(VGG-19)
Capsule-Forensics
(ResNet-50)
Capsule-Forensics
(XceptionNet FT)
Feature aggregation
(VGG-19)
Feature aggregation
(ResNet-50)
Multi-task
learning
XceptionNet
EfficientNet-B4
35
40
45
50
55
60
65
70
75
80
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12
Inference time (s)
Accuracy
(%)
Correlation between the scores of several detectors on
the public and private datasets of the DFDC1. Many
detectors struggle with the domain mismatch issue.
1 Image obtained from https://www.facebook.com/mediaforensics2020/videos/1640779116079742/
32. 32
4. Discussion
Does ”deepfake” have good applications?
àFast and easy content creation and editing
• Synthetic Media: https://www.syntheticmedialandscape.com/
• Synthesia STUDIO: https://www.synthesia.io/
33. 33
4. Discussion
Potential/challenging topics in deepfake detection:
• Low-quality input deepfake detection
• Cross-domain deepfake detection
• Online learning
• Explanable AI: Result explanation, finding/reconstructing original
images/videos
à Deepfake detection in the wild (real-world applications)
34. 34
5. References
Some nice survey papers:
• Verdoliva, Luisa. "Media forensics and deepfakes: an overview." IEEE Journal of Selected
Topics in Signal Processing 14, no. 5 (2020): 910-932.
• Tolosana, Ruben, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier
Ortega-Garcia. "Deepfakes and beyond: A Survey of face manipulation and fake
detection." Information Fusion 64 (2020): 131-148.