Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image

Published in USENIX Security '24: The 33rd USENIX Security Symposium, 2024

Jiang, Nan, Bangjie Sun, Terence Sim, and Jun Han. “Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image.” In 33rd USENIX Security Symposium (USENIX Security 24), pp. 1045-1062. 2024. Acceptance Rate: 18.3% (417 of 2276). [paper] [slides] [talk]

Abstract

We present Foice, a novel deepfake attack against voice authentication systems. Foice generates a synthetic voice of the victim from just a single image of the victim’s face, without requiring any voice sample. This synthetic voice is realistic enough to fool commercial authentication systems. Since face images are generally easier to obtain than voice samples, Foice effectively makes it easier for an attacker to mount large-scale attacks. The key idea lies in learning the partial correlation between face and voice features and adding to that a face-independent voice feature sampled from a Gaussian distribution. We demonstrate the effectiveness of Foice with a comprehensive set of real-world experiments involving ten offline participants and an online dataset of 1029 unique individuals. By evaluating eight state-of-the-art systems, including WeChat’s Voiceprint and Microsoft Azure, we show that all these systems are vulnerable to Foice attack.