Welcome back to our blog! Today we are going to talk about our framework for multilingual movies’ dubbing by using AI-generated voices. We present to you iSynchro! This blog is dedicated to the idea, implementation, and advantages of using such a methodology in the movie industry. Stay tuned to discover the benefits of iReason’s iSynchro.
Introduction
Оnline streaming services that offer a vast range of movies, TV shows, documentaries and any kind of video entertainment, however, most of the produced movies are in English, so a need arises for their translation and acquisition of suitable subtitles in different languages, depending on where the streaming services are expanding and focusing their market. Even if their target group is from another country in another speaking language, the translation process is very demanding and challenging. A popular trend that is occurring is film dubbing, a process of translating the audio of foreign-language movies into the audio of the audience’s language thus enabling endless opportunities for filmmakers and endless choices of movies to the audience. Its popularity has grown since it is a terrific choice for individuals who would rather listen to audio than read something at the bottom of the screen. This assistive approach enables reaching the elderly population or blind and people with vision impairments. The dubbing process consists of hiring voice actors who are native and speak in the chosen language and are utilized as dubbers to substitute the original actors’ voices with speech. To complete the soundtrack and maintain the background voices, the new audio is combined with the original sound. This process costs much more than translating the transcription and thus providing subtitles but it reaches different target groups. Hence, a necessity emerges for making this process easier, more effective and less expensive. So, iSynchro comes to the rescue! 🚀
The idea 💡
Let’s first start with the idea. What is iSynchro? iSynchro represents a framework that can convert the original voice of the actors into other languages while preserving their unique styles of talking. We are proposing multilingual movie dubbing that we call AI-aided movie synchronization. The pipeline consists of the extraction of audio samples which contain voices, then for each audio segment, a respective voice is generated using AI text-to-speech models in the desired language which are further modified using a voice style transfer model that applies the unique speaking style of the original actor. Since there is background noise in the movies, another AI model is used to extract only the background noise, music, voices, etc. After obtaining all samples, they are combined with the background track and applied to the original video. Currently, the framework works with 13 languages with an intention to cover even more. Some of them are: English, German, Bulgarian, French, Japanese, Spanish, Macedonian, Romanian, Russian, Slovenian, Serbian, Turkish, Zulu, and Croatian.
The audio extraction 🎙️
The extraction of audio segments is one of the trickiest parts since it is a very demanding process. Every time a voice appears or is recognized the respective sample is cut out, saved separately and later transcribed. Each segment is labelled with an index and the name of the actor of the appropriate voice, so we can preserve the knowledge of which style to be used later on. The segments are transcribed in the original language, most commonly in English.
The AI-voice generator 🤖
Each of the audio segments serves as input to the text-to-speech models that are trained on those previously mentioned 13 languages. For each record, it is known who the speaker is, the gender of the speaker (male/female) so a more human-like voice is generated that corresponds with the actor speaking, the original and the desired language to be translated in. The English transcriptions are translated into the desired language, and those sentences are used for the generation of the TTS voices. That way, we obtain AI-generated voices in the chosen language that match the original transcription.
The voice style transfer 🔀
Following the newly generated AI voices, a styling technique is used to apply each individual style to the new audio segments. This process is called voice style transfer also known as voice conversion. The idea is to employ the style of speaking of the original actor in the AI-generated audio. For that purpose we are utilizing the state-of-the-art model YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. The model takes the original audio segment and the original language as references and uses the learnt features to modify the AI-generated audio sample from the previous step. This way we are preserving the uniqueness of the actor and all of his/her speaking characteristics. The process is applied to each voice segment.
The vocal remover 🧑🔧
To enable even more realistic movie synchronization we are adding the background noises that are an inevitable part of every video recording. In this case, we are using the Ultimate Vocal Remover which is a model that can process audio recording and separate it into two audio files, one containing only voices, and the other one containing only the background noises. The second audio file, without any modifications to the background music, is combined with the generated audio samples from the previous step.
The generation of a new video 📽️
Finally, the movie synchronization is obtained when the audio file of the background sound is applied to the original video. Additionally, each of the styled AI-generated samples is attached to its appropriate position when the respective actor is speaking. The procedure is repeated as many times as needed, depending on the number of languages chosen for the synchronization of that movie.
Conclusion
iSynchro’s potential is of great importance as it serves as an assistive technology for visually impaired people and people who prefer listening rather than reading the script in their language. It is still in its early phase of development but the progress is continuous as we strive to automate as many processes as possible to make this even more easier and accessible.
This blog is just an introduction to this revolutionary framework that can be easily incorporated into many existing streaming services which focus on expanding globally. As we improve our methodology, more blogs will be published with updates.
Check out our website: https://isync.ireason.mk/ and stay tuned for more information!