PlayHT has just released PlayHT 2.0, a Generative Text-to-Voice AI Model designed specifically for rendering conversational speech.
Notably, this model introduces, for the first time, the innovative concept of adding “Emotions” to Generative Voice AI.
This pioneering approach grants users an unprecedented level of control, enabling them to guide the generation of speech infused with specific emotions.
Currently, PlayHT 2.0 is in its closed beta phase but will soon be made accessible through PlayHT’s API and Studio.
Discover the capabilities of voice AI. Test PlayHT for free and see the difference in speech synthesis.
Evolution: From PlayHT 1.0 to 2.0
A mere eight months prior, they introduced PlayHT 1.0, PlayHT’s inaugural Large Language Model (LLM) for Speech Synthesis.
This model showcased impressive results in terms of speech synthesis quality and voice cloning.
A demonstration featuring an AI-mediated podcast between Joe Rogan and Steve Jobs highlighted the AI’s capacity to produce speech rivalling human expressiveness and quality.
However, PlayHT 1.0 had its set of challenges:
- Limited in zero-shot functionalities.
- Generation of brief speech segments.
- No control over specific speech styles or emotions.
- Functionality was restricted to just the English language.
Such limitations largely stemmed from its foundational architecture, a constrained dataset, and a narrow spectrum of speaker diversity.
But armed with insightful feedback from its community, PlayHT set out on a mission.
The goal was clear: overcome the preceding model’s limitations while simultaneously pushing the boundaries of what voice AI could achieve in producing more human-like conversational nuances.
The outcome? PlayHT 2.0. This iteration amplified its model size by 10x and accumulated a dataset boasting over 1 million hours of speech, covering a diverse range of languages, accents, and speaking patterns.
Underlying this advanced Speech Synthesis model is a sophisticated neural network, echoing the transformer-based strategies seen in models such as OpenAI’s DALLE-2, but is distinctly tailored for the auditory domain.
Central to this system is a Large Language Model (LLM). One can envisage this LLM as an ardent reader, having dedicated over 500 years to absorbing vast swathes of audio transcriptions.
This immense knowledge equips the LLM with a predictive ability. When presented with a transcript and nuanced hints about a speaker, the model adeptly predicts the nature of the corresponding audio.
This is achieved by translating the text into MEL tokens. Yet, these tokens represent merely the foundational structure of the sounds.
Enter the decoder model, which refines these foundational markers, akin to an artist elaborating on a basic sketch.
Also read:
Enhancements in Conversational Fluidity
A significant focus was placed on training PlayHT 2.0 to emulate authentic human conversations. This refinement enables its aptitude in diverse conversational contexts, spanning phone calls to podcasts and audio messaging.
Crafting genuine human speech necessitates the model to simulate cognitive processes while speaking, and the strategic inclusion of filler words boosts this realism.
Accelerated Performance Metrics
One of the major stumbling blocks of PlayHT 1.0 was its computationally intensive nature, which often resulted in lag.
With PlayHT 2.0, substantial architectural refinements have been made to bolster the model’s efficiency, slashing latency to real-time conversational standards.
As of now, PlayHT 2.0 can generate speech in a brisk 800ms, and there’s an anticipation of further enhancements in the coming times.
Pioneering Voice Cloning and Emotion Direction
Among the standout features of PlayHT 2.0 is its prowess to accurately replicate voices, needing just a snippet of 3 seconds of speech.
This cloning occurs in real time, bypassing the need for extended fine-tuning. Because of its robust training foundation, it can mimic voices across numerous languages and accents.
Impressively, it can transpose voices from one language, making them articulate in another while retaining the original accent’s essence.
Although still in its infancy, the emotion direction capability in PlayHT 2.0 is noteworthy. This aspect of the model’s training allows it to discern and integrate emotions in real-time voice outputs.
It signifies the initial steps towards the goal of real-time emotional direction based merely on prompts.
With PlayHT 2.0 now available in its alpha stage through the Studio and API, the forthcoming weeks promise a series of upgrades to enhance its quality, responsiveness, and overall capabilities.
It’s a testament to the continuing strides in voice AI, pushing closer to the ultimate goal: synthetic voices that are indistinguishable from our own.