“We train our models on a huge dataset with thousands of speakers,” Jose Sotelo, a team member at Lyrebird and a speech synthesis expert, told Gizmodo. “Then, for a new speaker we compress their information in a small key that contains their voice DNA. We use this key to say new sentences.”
The end result is far from perfect—the samples still exhibit digital artifacts, clarity problems, and other weirdness—but there’s little doubt who is being imitated by the speech generator. Changes in intonation are also discernible. Unlike other systems, Lyrebird’s solution requires less data per speaker to produce a new voice, and it works in real time. The company plans to offer its tool to companies in need of speech synthesis solutions.
“We take seriously the potential malicious applications of our technology,” Sotelo told Gizmodo. “We want this technology to be used for good purposes: giving back the voice to people who lost it to sickness, being able to record yourself at different stages in your life and hearing your voice later on, etc. Since this technology could be developed by other groups with malicious purposes, we believe that the right thing to do is to make it public and well-known so we stop relying on audio recordings [as evidence].”