blog post
7,200 Languages
It’s not about the content but the delivery.
In a world brimming with over 7,200 languages, the promise of Artificial Intelligence breaking down linguistic barriers has researchers worldwide fixated. But here’s the challenge: the AI spectrum, largely data-driven, has its focal point narrowed down to the 10-12 dominant languages, sidelining approximately 50% of our global inhabitants.
So, could we be on the cusp of designing an AI model that obliterates language barriers altogether?
While we aren’t entirely there, there’s groundbreaking progress on the horizon. Enter Meta’s SeamlessM4T, an avant-garde model supporting a myriad of functions: from speech-to-speech translation to automatic speech recognition, and this for an impressive array of up to 100 languages. A comprehensive solution, one might call it the “Holy Grail” in the AI realm.
For context, even achieving an English-Italian text-only translation poses challenges. Imagine then the hurdles for translating minority languages! The predicament? A paucity of speech data.
But here’s the silver lining: burgeoning AI models harnessing unsupervised or self-supervised learning are yielding commendable outcomes. The upside? Such methodologies don’t necessitate human-curated data, enabling the harnessing of colossal datasets that would be impractical to label manually.
This innovation has ushered in an era of integrated models, specifically engineered for dedicated applications, which when combined, present the zenith in AI speech modeling: end-to-end speech translation.
Meta’s recent breakthrough is laudable, pioneering a model proficient in handling both speech and text translations across a whopping 100 languages. An all-inclusive model, if you will.
But how is this achieved? It’s a blend of optimization for multiple modalities coupled with versatility across tasks, housed under the innovative UnitY architecture. In essence, the X2T model takes in speech or text and renders text. This is the foundational step for speech-to-text, text-to-text, and automatic speech recognition.
For audio output based on text or speech input, the solution lies in a two-step approach. The initial step translates inputs into text via the X2T model, which is then converted into speech using an intricate combination of models. The crux lies in transforming text into discrete acoustics that are seamlessly converted into speech.
Summing it up, Meta’s SeamlessM4T is an all-encompassing machine translation model, supporting a diverse range of 100 languages in multiple formats. Beyond its functionality, it emblemizes Meta’s strategy in the AI domain. Their commitment to open-sourcing their models, albeit with commercial limitations, denotes their mission to democratize AI and prevent monopolization.
It’s a reminder that as AI seamlessly integrates into the business landscape, the competitive edge will be derived not from AI alone, but from how it’s strategically intertwined with a company’s vision.
Author
Steve King
Managing Director, CyberEd
King, an experienced cybersecurity professional, has served in senior leadership roles in technology development for the past 20 years. He has founded nine startups, including Endymion Systems and seeCommerce. He has held leadership roles in marketing and product development, operating as CEO, CTO and CISO for several startups, including Netswitch Technology Management. He also served as CIO for Memorex and was the co-founder of the Cambridge Systems Group.