By Taiba Majid Wani, Sapienza University of Rome, Italy, majid@diag.uniroma1.it | Syed Asif Ahmad Qadri, National Tsing Hua University, Taiwan, syedasif@m110.nthu.edu.tw | Farooq Ahmad Wani, Sapienza University of Rome, Italy, wani@diag.uniroma1.it | Irene Amerini, Sapienza University of Rome, Italy, amerini@diag.uniroma1.it
The rise of audio deepfakes presents a significant security threat that undermines trust in digital communications and media. These synthetic audio technologies can convincingly mimic a person’s voice, enabling malicious activities like impersonation, fraud, and misinformation. Addressing this growing threat requires robust detection systems to ensure the authenticity of digital content.
In this survey, we provide a comprehensive analysis of the state-of-the-art techniques in audio deepfake generation and detection. We examine various methods used to generate audio deepfakes, including Text-to-Speech (TTS) and Voice Conversion (VC) technologies, and discuss their capabilities in producing highly realistic synthetic audio. On the detection front, we explore a wide range of approaches, encompassing traditional machine learning and deep learning models for feature extraction and classification. The importance of publicly available datasets for training and evaluating these models is emphasized, showcasing their role in advancing detection capabilities.
Additionally, the integration of audio and video deepfake detection systems is discussed, providing a comprehensive defense against sophisticated attacks. This survey critically assesses existing methods and datasets, highlighting challenges like the high realism of deepfakes, limited data diversity, and the need for models that generalize well. It aims to guide future research in enhancing detection and safeguarding digital media integrity.
The rise of audio deepfakes presents a significant security threat that undermines trust in digital communications and media. These synthetic audio technologies can convincingly mimic a person’s voice, enabling malicious activities like impersonation, fraud, and misinformation. Addressing this growing threat requires robust detection systems to ensure the authenticity of digital content.
In this monograph, a comprehensive analysis of the state-of-the-art techniques in audio deepfake generation and detection is provided. Various methods used to generate audio deepfakes are examined, including Text-to-Speech (TTS) and Voice Conversion (VC) technologies, and their capabilities in producing highly realistic synthetic audio are discussed. On the detection front, a wide range of approaches are explored, encompassing traditional machine learning and deep learning models for feature extraction and classification. The importance of publicly available datasets for training and evaluating these models is emphasized, showcasing their role in advancing detection capabilities. Additionally, the integration of audio and video deepfake detection systems is discussed, providing a comprehensive defense against sophisticated attacks.
This monograph critically assesses existing methods and datasets, highlighting challenges like the high realism of deepfakes, limited data diversity, and the need for models that generalize well. It aims to guide future research in enhancing detection and safeguarding digital media integrity.