Introducing Lightning V3: State-of-the-Art Conversational Text-to-Speech Model
https://x.com/smallest_ai/status/2036483989310226505?s=12
Product announcement with technical deep-dive and industry critique; combines press release, technical blog post, and academic-style evaluation methodology critique · Researched March 27, 2026
URL copied — paste it as a website source in a new notebook
Summary
Smallest.ai announced Lightning V3, a new text-to-speech model specifically designed for real-time conversational voice agents rather than read-aloud applications. The model achieves a Mean Opinion Score (MOS) of 3.89, the highest for conversational TTS, and demonstrates a 76% win rate against OpenAI's GPT-4o-mini-tts on naturalness evaluations. The company claims Lightning V3 outperforms competing models from ElevenLabs, Cartesia, and OpenAI across key metrics including intonation, prosody, and naturalness.
Beyond benchmarks, the announcement makes a more provocative argument: traditional TTS evaluation metrics are fundamentally insufficient for assessing conversational AI. The blog post argues that the TTS industry has spent years optimizing for how well a model reads continuous text—a task now largely "solved"—when the real challenge is creating voices that sound natural during real-time, streaming speech generation where the model must synthesize audio before receiving complete semantic context. In conversation, a voice must handle incomplete information, adapt prosody reactively to conversational context, produce natural disfluencies and irregularities that signal human cognition, and maintain user engagement without sounding scripted.
Lightning V3 supports 15 languages with automatic mid-sentence language switching, operates at sub-100ms latency for interactive applications, and includes a voice cloning feature that requires only 5-15 seconds of reference audio. A newer variant (V3.2) adds instruction-following capability for emotional register and prosodic control. The company positions conversational generation as the hardest acoustic context TTS can face and argues that a model optimized for conversation will perform well across all other TTS applications—voiceovers, audiobooks, podcasts, dubbing—since those use cases are less demanding.
Key Takeaways
Lightning V3 achieves 3.89 MOS (Mean Opinion Score) for conversational TTS, the highest published score for this specific evaluation context, outperforming OpenAI, ElevenLabs, and Cartesia models on this metric.
The model achieves a 76% win rate versus GPT-4o-mini-tts on overall naturalness evaluations, with particular strength in intonation (3.33) and prosody (3.07), the critical dimensions that signal natural conversation.
Conversational TTS differs fundamentally from read-aloud TTS because synthesis happens in real-time, streaming chunks as an LLM generates responses. Traditional end-to-end evaluation overstates performance in actual deployment because the model must sound natural with incomplete semantic context at every chunk boundary.
The blog post argues "TTS Evals Are Dead"—the field's standard metrics (MOS, WER, CER) are structurally incomplete for conversational applications because they rely on subjective framing, rater composition, and prompting that shift results dramatically, and cannot distinguish between audio that is naturally expressive versus naturalness appropriate to a specific persona and task.
Lightning V3 supports 15 languages with mid-sentence code-switching, enabling natural speech for multilingual conversations (e.g., Hinglish, Spanglish) rather than only supporting language switches at paragraph boundaries, addressing a real-world constraint for global voice agents.
Voice cloning from just 5-15 seconds of reference audio creates a production-grade replica that captures not just timbre but the natural irregularities and disfluencies that make a human voice feel like a person rather than a performance.
The model operates with sub-100ms latency and generates audio natively at 44.1 kHz, exceeding the Nyquist frequency limit for full audible speech, while cleanly downsampling to telephony standards (8, 16, 24 kHz) without model changes.
The company argues that voices should be evaluated in context against the persona they inhabit—a calm healthcare agent, high-energy sales rep, and measured financial advisor should not be judged against the same prosodic ideal because they have different communicative goals and acceptable parameter ranges.
Human conversation contains natural disfluencies at approximately 5-6 per 100 words (filled pauses, silent pauses, prolongations, repetitions), and human listeners remain sensitive to these irregularities; over-smoothed synthesis that eliminates them sounds less natural than irregular synthesis that preserves human spontaneity.
Smallest.ai's evaluation was conducted specifically in a conversational generation setting using Seed-TTS corpus samples and an LLM-as-judge framework, differing from industry-standard end-to-end utterance synthesis evaluation that systematically overstates real-world performance for streaming applications.
About
Author: Sudarshan Kamath (Founder/CEO of smallest.ai)
Publication: smallest.ai (X/Twitter announcement + blog post)
Published: 2026-03-24
Sentiment / Tone
Confident and provocative with a didactic tone. The author positions smallest.ai as advancing not just model quality but industry understanding. There's a deliberate intellectual challenge embedded in the announcement—the blog post doesn't just showcase superior benchmarks, it critiques the entire TTS evaluation paradigm, suggesting the industry has been measuring the wrong things. The tone is collaborative rather than combative: the authors acknowledge that traditional metrics were appropriate for earlier challenges, but argue the field needs to evolve. The writing is evidence-driven and reflective, including audio examples and admitting nuance (e.g., that the 76% win rate is "worth interrogating" because preference rankings shift with context). This positions smallest.ai as thoughtful researchers rather than mere vendors claiming superiority.
Blizzard Challenge 2025 Report Annual benchmarking challenge report that separately measured naturalness and appropriateness, finding significant differences for some TTS systems—evidence that context-dependent evaluation matters and invalidates one-size-fits-all scoring.
Sudarshan Kamath's Original Thread on Lightning V3 The original long-form thread from the founder with detailed metrics and links to the blog post; provides the raw announcement with technical specifications.
TTS Benchmark 2025: Smallest.ai vs ElevenLabs Comparative Report Detailed head-to-head benchmark comparing Lightning V3 with ElevenLabs across latency, quality, and cost dimensions, providing additional context on competitive positioning and performance tradeoffs.
Research Notes
**Author background**: Sudarshan Kamath is the founder and CEO of smallest.ai, co-founded with Akshat Mandloi in November 2023. Kamath is Indian-origin with approximately a decade of experience in AI, with a background in product management and data science. He's educated at UC San Diego. The company is based in San Francisco with an office in Bangalore, India. Kamath gained attention in 2025 for offering lucrative job packages ($600K+) to laid-off Meta AI engineers, signaling the company's focus on attracting top research talent. Smallest.ai's stated mission is "AGI under 10B parameters"—building ultra-efficient speech models and voice agents.
**Industry reactions**: The announcement received notably positive reception from AI practitioners and influencers on X. AI researcher Rohan Paul highlighted that the entire TTS industry has optimized for the wrong objective—text reading rather than real-time conversation. Developer commenters called out the evaluation critique, with several noting that "TTS Evals Are Dead" represents what "nobody in the industry wanted to say." The framing resonated because it articulates a widespread frustration: vendor benchmarks often compare narrow metrics without accounting for real-world deployment constraints.
**Broader context**: The TTS market has fragmented significantly by 2026. ElevenLabs, Cartesia, and OpenAI dominate, but smallest.ai's emphasis on sub-100ms latency and conversational naturalness targets a specific pain point: voice agents that sound robotic or create friction. The field has indeed largely solved the "text reading well" problem—standard TTS is intelligible. The new frontier is exactly what smallest.ai claims: conversational naturalness under constraints of real-time synthesis and incomplete context.
**Evaluation methodology significance**: The blog post's critique of MOS (Mean Opinion Score) is academically grounded. It cites the 2025 Blizzard Challenge and INTERSPEECH 2025 papers demonstrating that evaluation outcomes shift dramatically based on prompt framing and whether listeners are asked "how natural?" versus "how appropriate for this context?" This undermines vendor benchmarks that claim definitive superiority on MOS without acknowledging these dependencies. The LLM-as-judge approach smallest.ai uses is newer but has known limitations (prompt sensitivity, potential bias toward synthetic speech that mimics LLM-preferred patterns).
**Credibility caveats**: While the company's benchmarks appear rigorous, they are conducted by the vendor (smallest.ai) rather than independent third parties. The evaluation corpus and methodology are published, supporting replication, but the company controls interpretation. The 3.89 MOS figure, while impressive, must be contextualized: MOS typically ranges 1-5, with 4+ indicating near-human quality, but this measurement is context-dependent. The 76% comparative win rate is particularly susceptible to frame-dependence and should be weighted less heavily than the underlying quality metrics.
**Market implications**: This announcement signals shifting competitive dynamics in TTS. Speed (latency) and conversational appropriateness are becoming differentiators, not just audio fidelity. The emphasis on voice cloning and persona-specific evaluation suggests smallest.ai is positioning itself for enterprise deployments where customization and rapid voice creation matter—call centers, customer service, IVR replacements.
**Related academic context**: The blog post references several peer-reviewed papers (INTERSPEECH 2025, Blizzard Challenge 2025) that support its evaluation critique. These suggest the academic TTS community is grappling with the same problem: how to measure what matters in real-world voice agent deployment rather than clean lab conditions.
Topics
Text-to-Speech (TTS) models and evaluationConversational AI and voice agentsReal-time speech synthesis and latency optimizationAI evaluation metrics and benchmarking limitationsVoice cloning and speaker similarityMultilingual speech synthesis