Introducing Lightning V3: State-of-the-Art Conversational Text-to-Speech Model

https://x.com/smallest_ai/status/2036483989310226505?s=12
Product announcement with technical deep-dive and industry critique; combines press release, technical blog post, and academic-style evaluation methodology critique · Researched March 27, 2026

Summary

Smallest.ai announced Lightning V3, a new text-to-speech model specifically designed for real-time conversational voice agents rather than read-aloud applications. The model achieves a Mean Opinion Score (MOS) of 3.89, the highest for conversational TTS, and demonstrates a 76% win rate against OpenAI's GPT-4o-mini-tts on naturalness evaluations. The company claims Lightning V3 outperforms competing models from ElevenLabs, Cartesia, and OpenAI across key metrics including intonation, prosody, and naturalness.

Beyond benchmarks, the announcement makes a more provocative argument: traditional TTS evaluation metrics are fundamentally insufficient for assessing conversational AI. The blog post argues that the TTS industry has spent years optimizing for how well a model reads continuous text—a task now largely "solved"—when the real challenge is creating voices that sound natural during real-time, streaming speech generation where the model must synthesize audio before receiving complete semantic context. In conversation, a voice must handle incomplete information, adapt prosody reactively to conversational context, produce natural disfluencies and irregularities that signal human cognition, and maintain user engagement without sounding scripted.

Lightning V3 supports 15 languages with automatic mid-sentence language switching, operates at sub-100ms latency for interactive applications, and includes a voice cloning feature that requires only 5-15 seconds of reference audio. A newer variant (V3.2) adds instruction-following capability for emotional register and prosodic control. The company positions conversational generation as the hardest acoustic context TTS can face and argues that a model optimized for conversation will perform well across all other TTS applications—voiceovers, audiobooks, podcasts, dubbing—since those use cases are less demanding.

Key Takeaways

About

Author: Sudarshan Kamath (Founder/CEO of smallest.ai)

Publication: smallest.ai (X/Twitter announcement + blog post)

Published: 2026-03-24

Sentiment / Tone

Confident and provocative with a didactic tone. The author positions smallest.ai as advancing not just model quality but industry understanding. There's a deliberate intellectual challenge embedded in the announcement—the blog post doesn't just showcase superior benchmarks, it critiques the entire TTS evaluation paradigm, suggesting the industry has been measuring the wrong things. The tone is collaborative rather than combative: the authors acknowledge that traditional metrics were appropriate for earlier challenges, but argue the field needs to evolve. The writing is evidence-driven and reflective, including audio examples and admitting nuance (e.g., that the 76% win rate is "worth interrogating" because preference rankings shift with context). This positions smallest.ai as thoughtful researchers rather than mere vendors claiming superiority.

Related Links

Research Notes

**Author background**: Sudarshan Kamath is the founder and CEO of smallest.ai, co-founded with Akshat Mandloi in November 2023. Kamath is Indian-origin with approximately a decade of experience in AI, with a background in product management and data science. He's educated at UC San Diego. The company is based in San Francisco with an office in Bangalore, India. Kamath gained attention in 2025 for offering lucrative job packages ($600K+) to laid-off Meta AI engineers, signaling the company's focus on attracting top research talent. Smallest.ai's stated mission is "AGI under 10B parameters"—building ultra-efficient speech models and voice agents. **Industry reactions**: The announcement received notably positive reception from AI practitioners and influencers on X. AI researcher Rohan Paul highlighted that the entire TTS industry has optimized for the wrong objective—text reading rather than real-time conversation. Developer commenters called out the evaluation critique, with several noting that "TTS Evals Are Dead" represents what "nobody in the industry wanted to say." The framing resonated because it articulates a widespread frustration: vendor benchmarks often compare narrow metrics without accounting for real-world deployment constraints. **Broader context**: The TTS market has fragmented significantly by 2026. ElevenLabs, Cartesia, and OpenAI dominate, but smallest.ai's emphasis on sub-100ms latency and conversational naturalness targets a specific pain point: voice agents that sound robotic or create friction. The field has indeed largely solved the "text reading well" problem—standard TTS is intelligible. The new frontier is exactly what smallest.ai claims: conversational naturalness under constraints of real-time synthesis and incomplete context. **Evaluation methodology significance**: The blog post's critique of MOS (Mean Opinion Score) is academically grounded. It cites the 2025 Blizzard Challenge and INTERSPEECH 2025 papers demonstrating that evaluation outcomes shift dramatically based on prompt framing and whether listeners are asked "how natural?" versus "how appropriate for this context?" This undermines vendor benchmarks that claim definitive superiority on MOS without acknowledging these dependencies. The LLM-as-judge approach smallest.ai uses is newer but has known limitations (prompt sensitivity, potential bias toward synthetic speech that mimics LLM-preferred patterns). **Credibility caveats**: While the company's benchmarks appear rigorous, they are conducted by the vendor (smallest.ai) rather than independent third parties. The evaluation corpus and methodology are published, supporting replication, but the company controls interpretation. The 3.89 MOS figure, while impressive, must be contextualized: MOS typically ranges 1-5, with 4+ indicating near-human quality, but this measurement is context-dependent. The 76% comparative win rate is particularly susceptible to frame-dependence and should be weighted less heavily than the underlying quality metrics. **Market implications**: This announcement signals shifting competitive dynamics in TTS. Speed (latency) and conversational appropriateness are becoming differentiators, not just audio fidelity. The emphasis on voice cloning and persona-specific evaluation suggests smallest.ai is positioning itself for enterprise deployments where customization and rapid voice creation matter—call centers, customer service, IVR replacements. **Related academic context**: The blog post references several peer-reviewed papers (INTERSPEECH 2025, Blizzard Challenge 2025) that support its evaluation critique. These suggest the academic TTS community is grappling with the same problem: how to measure what matters in real-world voice agent deployment rather than clean lab conditions.

Topics

Text-to-Speech (TTS) models and evaluation Conversational AI and voice agents Real-time speech synthesis and latency optimization AI evaluation metrics and benchmarking limitations Voice cloning and speaker similarity Multilingual speech synthesis