Researchers at the University of Chicago, led by Emily Wenger, developed a neural network for synthesizing human voice: it was able to trick speech recognition systems and even other people.

Today, many systems use user identification by voice, for example, Yandex smart speakers recognize the owner’s voice, and you can log into your WeChat account by saying a certain phrase. The developers of these and other services assume that a person’s voice is unique and that it is a reliable means of verifying one’s identity.

But systems for synthesizing the human voice are learning fast. The authors of the new work decided to check how well the algorithm adapts to a certain timbre and intonation. The authors proceeded from a situation in which the attacker has access to the victim’s voice samples in the form of publicly available audio or video recordings, as well as the ability to communicate with him live and record the speech.

The total length of the entire voice recording is no more than 5 minutes. Then, based on this data, the attacker could train the algorithm to the desired result. In this case, only publicly available algorithms could be used, the authors chose two: SV2TTS and AutoVC. To train the models, the authors used speech recordings of 90 people from three public datasets: VCTK, LibriSpeech, and SpeechAccent.

During testing, the SV2TTS model and the VCTK dataset proved to be the most successful. For Resemblyzer, the percentage of successful attacks was 50.5 ± 13.4%, for Azure – 29.5 ± 32%.

To test WeChat and the voice assistant Alexa, the researchers recruited 14 volunteers: first, they trained a model on their voice, and then tested the system on synthesized recordings. As a result, 9 out of 14 people managed to log into WeChat, and Alexa sooner or later managed to deceive everyone.

Also, when talking with the algorithm, a person could not distinguish a real voice from a fake one at 50%.