Connect with us

Technology

AI speech generator 'reaches human parity' — but it's too dangerous to release, scientists say

Published

on

/ 8002 Views

Microsoft has developed a new artificial intelligence (AI) speech generator that is apparently so convincing it cannot be released to the public.

VALL-E 2 is a text-to-speech (TTS) generator that can reproduce the voice of a human speaker using just a few seconds of audio.

Microsoft researchers said VALL-E 2 was capable of generating "accurate, natural speech in the exact voice of the original speaker, comparable to human performance," in a paper that appeared June 17 on the pre-print server arXiv. In other words, the new AI voice generator is convincing enough to be mistaken for a real person — at least, according to its creators.

"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time," the researchers wrote in the paper. "Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases."

Related: New AI algorithm flags deepfakes with 98% accuracy — better than any other tool out there right now

Human parity in this context means that speech generated by VALL-E 2 matched or exceeded the quality of human speech in benchmarks used by Microsoft.

The AI engine is capable of this given the inclusion of two key features: "Repetition Aware Sampling" and "Grouped Code Modeling."

Trending