Microsoft AI model can speak in your voice after 3-second training

The AI-powered text-to-real voice converter accurately mimics human speech along with its emotionality and even considering extraneous sounds that may surround at the time of the conversation.

The first week of the new year pleased fans of technological innovations with the presentation of a new AI model from Microsoft. After "listening" to a 3-second sample of a person's speech, it accurately imitates voice and is quite ready to voice the written text. After preliminary training, the AI model can become a full-fledged voice understudy because it not only copies speech, but also considers all its nuances: tempo, timbre, volume, expressiveness, and so on. The developers called their new product VALL-E.

Microsoft's experts characterize VALL-E as a neural codec language model, since it is based not only on AI, but also on the latest EnCodec technology developed by Meta. The "parents" of VALL-E believe that in the future their offspring can be used to develop next-generation services that convert text to speech and edit it, as well as to create high-quality audio content. It is expected that Microsoft's new product will become a valuable part of the collection of generative AI models.

At the same time, the developers understand that such technology can be very dangerous. Since VALL-E accurately copies the speech and speech behavior of a person, it may well be used for criminal purposes: for example, to pass voice identification instead of someone or to introduce themselves as another person. To combat these risks, it is necessary to create a special model that will distinguish simulated speech from real.

Microsoft Unveils AI-powered Voice Simulator