This new AI model hears your tone, senses your mood, and talks back like a real human. Siri could never.

4 hours ago 1

Inworld CEO Kylan Gibbs writes on a whiteboard at the AI startup's Silicon Valley headquarters

Inworld AI launched a new voice model called Realtime TTS-2.
The new model analyzes vocal cues for tone and emotion, improving natural interaction.
Inworld focuses on providing models to developers, avoiding app competition with its customers.

Inworld AI rolled out a new AI voice model designed to make conversations with machines feel more human by understanding not just what users say but how they say it.

The Mountain View-based startup's latest system, Realtime TTS-2, analyzes vocal cues such as tone, pacing, and pitch to infer a speaker's emotional state in real time. It then dynamically adjusts its own voice and delivery to create more natural, emotionally aware interactions (TTS stands for text-to-speech, a type of voice-based AI model).

As AI voice models become more realistic, that could increase usage and engagement. While text-based models, AI coding, and image generation have been big hits so far, speaking with models and chatbots is potentially a more natural way to use this technology. Inworld CEO Kylan Gibbs believes solving the emotional layer is essential for this to happen at scale.

"Real-time conversation, as we're having now, is the natural mode that people interact with," he told me in a recent interview. "The closer you get to that, the more engagement you see."

The release marks a shift in focus for the company, which has raised more than $100 million from investors including Founders Fund, Intel, and Microsoft. Inworld's previous model already ranked at the top of industry benchmarks for voice quality, outperforming rivals like Google and ElevenLabs. But Gibbs said that wasn't enough.

So far, most top AI voice models have been designed for audiobooks, voiceovers, and similar media content, according to Gibbs, a former DeepMind product manager.

"If you hear AI voice today, it sounds like a human, but it sounds like a human reading from a script, and there's something off," he said. "It might sound good, but it feels bad. Imagine just talking to an audiobook."

That disconnect, between realism and natural interaction, became Inworld's next target.

To tackle this, TTS-2 combines several capabilities that typically don't exist together in AI voice systems. For instance, it understands the full history and context of a conversation, so a line delivered after a joke lands differently than the same line delivered after bad news.

The new voice model can also detect emotional signals from humans' speech in real-time, and continuously updates what Inworld calls a "user state" and "agent state" to guide how the AI responds.

A live demo

In an exclusive live demo at Inworld headquarters in Silicon Valley, Gibbs showed me how TTS-2 performed. Within a few seconds, the AI voice model switched between several different states as Gibbs spoke and introduced different topics and tones.

One moment, the AI voice model was "empathetic, apologetic, and direct" when responding to a customer-service delay. It quickly evolved to "patient, warm, and clarifying," then "empathetic, helpful, fast-paced," depending on the context, topic, and how Gibbs was talking.

Mild amusement

Later in the live demo, an AI character named "Jason" illustrated how subtle those responses can be. After Gibbs made an intentionally inappropriate joke, the AI didn't ignore it or respond bluntly.

Instead, it delivered a carefully balanced reaction: "Well, I mean, it was definitely effective. You definitely got my attention. I don't know if I'd call it funny, but it was impressive in a way."

The tone conveyed mild amusement alongside polite disapproval, an example of the kind of nuance Inworld is aiming for.

Gibbs said this kind of emotional awareness has been largely missing from voice AI because existing systems treat speech as isolated text inputs. By contrast, TTS-2 is designed to interpret a broader range of signals, including delivery style and prosody — how something is said, rather than the words themselves.

The technology could have wide-ranging applications, from customer service and healthcare to education and AI companions, Gibbs said.

Just models and APIs

Inworld is positioning the model as infrastructure for developers rather than a consumer product, offering it through an API that plugs into existing AI systems. APIs, or application programming interfaces, are a common way apps share data and communicate.

While rival AI voice startup ElevenLabs is active at the application level with customers, Inworld is giving developers access to the underlying models and giving them more freedom to create their own applications on top.

This is partly because Gibbs wants to avoid competing with Inworld's customers. And the rise of AI coding tools such as Anthropic's Claude Code and OpenAI's Codex is making app development so much easier, so there's less value at that layer of the tech stack now, Gibbs said.

"We really now only produce models and APIs," he added.

Sign up for BI's Tech Memo newsletter here. Reach out to me via email at [email protected].