Deepfakes are posing significant challenges to forensic phonetics, undermining citizen security and trust in digital media. Thus, understanding the human ability to distinguish synthetic audio from authentic audio is crucial in addressing this growing threat.
Using PsychoPy, we conducted a perceptual experiment in which participants classified real and fake audio samples. The test featured Spanish and Japanese stimuli distributed to native speakers of each language to examine the impact of language knowledge on performance. Müller et al. (2022) and Mai et al. (2023) have explored this variable, whose results we aim to compare with our findings. Additionally, this study evaluates how speaking style (spontaneous vs. text reading) and familiarity with the speaker’s voice impact on performance.
The experiment includes 80 voice samples (8–10 seconds per stimulus), 50% real and 50% fake. For the real spontaneous speech samples, we selected 10 Spanish stimuli from VoxCeleb-ESP (Labrador et al., 2023) and 10 Japanese stimuli from EACELEB (Caulley, Yang, & Anderson, 2022). For the 20 real text-reading samples, 20 Spanish and Japanese were sourced from LibriVox and YouTube audiobooks. Furthermore, these 40 real stimuli (spontaneous and text reading) were cloned using Eleven Labs to generate their synthetic counterparts.