The AI-powered text-to-real voice converter accurately mimics human
speech along with its emotionality and even considering
extraneous sounds that may surround at the time of the conversation.
The first week of the new year pleased fans of technological innovations
with the presentation of a new AI model from Microsoft. After
"listening" to a 3-second sample of a person's speech, it accurately
imitates voice and is quite ready to voice the written text. After
preliminary training, the AI model can become a full-fledged voice
understudy because it not only copies speech, but also considers all its nuances: tempo, timbre, volume, expressiveness, and so
on. The developers called their new product VALL-E.
Microsoft's experts characterize VALL-E as a neural codec language model,
since it is based not only on AI, but also on the latest EnCodec
technology developed by Meta. The "parents" of VALL-E believe that in
the future their offspring can be used to develop next-generation
services that convert text to speech and edit it, as well as to create
high-quality audio content. It is expected that Microsoft's new product
will become a valuable part of the collection of generative AI models.
At the same time, the developers understand that such technology can be very dangerous. Since VALL-E accurately copies the speech and speech behavior of a person, it may well be used for criminal purposes: for example, to pass voice identification instead of someone or to introduce themselves as another person. To combat these risks, it is necessary to create a special model that will distinguish simulated speech from real.