AI trained on YouTube and podcasts speaks with ums and ahs

Technology

An artificial intelligence that has been trained on YouTube and podcast recordings generates speech from text prompts that sounds remarkably natural

By Alex Wilkins

9 March 2023

Image of digital waveforms — An AI can generate more natural-sounding synthetic speech by including pauses
Shutterstock/PrinceOfLove

Generating speech with different rhythms and pauses makes it sound more human-like, according to an assessment of an artificial intelligence trained on speech taken from YouTube and podcasts.

Most artificial intelligence text-to-speech systems are trained on data sets of acted speech, which can lead to the output sounding stilted and one-dimensional. More natural speech often displays a wide range of rhythms and patterns to convey different meanings and emotions.

Now, at Carnegie Mellon University in Pittsburgh, Pennsylvania, and his colleagues have used almost 900 hours of talking from YouTube and podcasts to train a text-to-speech AI.

Read more

ChatGPT can be made to write scam emails and it slashes their cost

鈥淭his allows you to synthesise speech in a way that better reflects how humans speak,鈥� says Rudnicky.

A user selects what voice the AI will use by supplying it with a sample of someone鈥檚 speech to mimic, such as the recording below.

Sample voice:

The model chops up the new speech data into discrete chunks, then uses a neural network to produce new vocalisations by predicting which chunk of speech 鈥� or umming or silence 鈥� is most likely to come next in a sequence. This is similar to how AI text generators like ChatGPT work.

This allows the model to take written prompts it is given, such as 鈥測eah so ah all of the a i conferences are open to anyone who is capable of ah you know make you know paying for the trip and the the ticket鈥�, and generate speech using the characteristic patterns of the chosen voice, such as in the example below.

AI-generated speech:

People recruited from the Amazon Mechanical Turk crowdsourcing platform judged the naturalness of the artificial speech on a five-point scale running from 1 (bad) to 5 (excellent), giving it an average score of 3.89. This is better than other AI-created voices, the closest of which managed 3.84. Actual human speech received a score of 4.01.

Read more

ChatGPT can be made to write scam emails and it slashes their cost

Producing the vocalisation bit by bit makes the model faster than others that generate entire sequences in one go, which could make it more suitable for applications such as audio chatbots or streaming services.

While the model can produce fairly natural-sounding speech, it is still just a proof-of-concept, says Rudnicky, and could be greatly improved by training it on more hours of data.

Read more

Voice jammer stops anyone from recording you speak

鈥淭hey clearly haven鈥檛 quite got to the point where it is totally human sounding, but they鈥檙e absolutely going in the right direction,鈥� says at the Alan Turing Institute in London.

The ability to mirror the patterns of human speech and how they change in different circumstances could be useful, says Beavan. Some situations call for certain ways of speaking, such as when you have just woken up in the morning and you would probably appreciate a more sensitive AI voice, or when it is an emergency and you might want a voice that conveys a sense of urgency, he says.

Reference

arXiv

Topics: Artificial intelligence