Synthetic Speakers Trained From Just Three Seconds of Speech

While text-to-speech has been around since before the Internet, it has taken some time to actually reproduce a real speaker's voice. Now, three seconds of speech is all it takes.

Finn Sunchaser

Jan 09, 2023

Can you guess which is real and which is fake?

1×

0:00

-0:08

1×

0:00

-0:07

The first text-to-speech (TTS) synthesis was demonstrated in the late 1950s by Bell Labs. The system was called "Audrey" and was able to synthesize simple phrases like "The rain in Spain stays mainly in the plain." The uses of TTS is not only useful as a substitute for reading text. Speech, as humans, is so important to us that simply having access to a monotonous, electronic voice is not sufficient. Thus, synthetic speech technology has been on a rise, being able to capture different prosodies, emotions and even the vocal qualities of a specific human speaker.

Since the 2010s, advances in deep learning has led to significant improvements in TTS synthesis. This has led to the improvement not only of the realism of the synthetic speaker, but the expressiveness and emotional range of synthesized speech. While using synthetic voices instead of a person’s voice in media is nothing new (Lance Henriksen’s lines were partially synthesized in Alien back in 1979), only recently have there been cases of actually using a synthetic voice as a total substitute of a voice actor. Peter Cushing’s lines in the Star Wars film Rogue One were synthesized using TTS technology trained from the actor’s voice in the original trilogy.

Last year, the state-of-the-art synthetic speech generators required at least several lines before it could replicate anything close to being realistic, if not hours worth of speech. The fact that three seconds is enough for a high-quality result is shocking. Someone passing you on the street with a microphone could record you saying a sentence or phrase and then have enough data to create a synthetic speaker that may even outlive you.

This raises an important question, what are rights does someone have to the synthetic version of the voice trained on? A person's voice is considered to be their personal property, and they may have legal rights in their voice under laws related to privacy, publicity, and intellectual property. Consent is also a key consideration in such legal matters. In the USA, the right of publicity is a legal principle that gives individuals the right to control the commercial use of their name, image, and likeness. This could potentially include the use of their voice in a synthetic voice product. In addition, a person's voice may be protected by copyright law if it is recorded and fixed in a tangible form, such as on a recording. The person who created the recording would generally be considered the copyright owner, and they would have the exclusive right to reproduce, distribute, and sell the recording. All of this brings to mind the recent controversies around AI art and IP infringement concerns.

In terms of commercialisation, it’s interesting to consider how this might affect media production. All of the time and money spent on hiring actors/speakers, recording their lines, doing redubs or retakes later in the production process, now one can simply type in the lines and have a synthetic voice actor deliver them. This must be done so that the real actor’s voice is used only with permission to train the synthetic voice actor. Of course, we wouldn’t expect such a synthetic speaker to win any Oscars, Emmies or the Annie Award, but this technology is still extremely disruptive. It’s a major concern for actors in general, as now much of their work can be outsourced. However, actors can now distribute their voices quicker than uploading an Instagram selfie. This means that smaller-scale media productions with lower budgets could leverage talent that was previously reserved to Hollywood Blockbusters. Even lower-status actors could benefit from this, as now instead of having to audition or travel or spend time recording lines, they can simply share a voice snippet and then receive commission for whenever it gets used. For an individual project this may not be much, but if their voice is used for several projects, this could lead to a reasonable income as well as a portfolio they can share for branding and self-promotion purposes. In theory.

Of course, as this technology gets better, so do the risks. In 2019, researchers at the University of California, Berkeley used a TTS system to create a deep fake video of former President Barack Obama, in which the synthesized voice of Obama was used to create the appearance that he was saying things that he had not actually said. Even if there are methods to detect real from synthetic speech, simply having a video or recording spreading virally may be enough to disrupt or swing an election or campaign, throwing a spanner in the works of informed democracy.

While there is great fear around using synthetic voices technologies, due to things like deepfakes, theft, misinformation, propaganda campaigns and job displacement, there is also a world of opportunity accompanying it. Synthetic speech will soon be virtually indistinguishable from real speech, although, even a masterful voice actor can’t capture every possible prosody or emotion that is needed for media production. Nonetheless, the presence of synthetic voice actors is a step towards our base reality and the Metaverse becoming more entangled. Perhaps all of us will be indirectly communicating online with synthetic version of our own voices in the next two years.

Hyperopia

Discussion about this post