March 26, 2025

How Cartesia Edges Out The Big Labs With Audio AI Models, with Karan Goel, Founder and CEO at Cartesia

It’s been said that the big labs are going to discover everything there is to find out in AI research. But startups like Cartesia are continually proving that wrong.

In this episode of Barrchives, I went deep with Karan Goel, Cartesia’s cofounder and CEO, about their state of the art audio models built on the state space architecture. We talked about how state space models work, how Cartesia is trying to bring latency down to 0, why startups can still outcompete big research labs, and how audio is going to be the dominant way we interact with computers.

On state space models and how they work:

“The best way to understand state space models is to think about how a model’s capabilities scale with the amount of information that it sees. The classic transformer architecture has quadratic scaling with context, which means that to drastically increase context windows with compute time at inference is very expensive. With state space models, they’re designed to scale better with more context, which is increasingly important as you get into these regimes with massive multimodal data, like audio.”

On bringing down audio model latency to 0:

“We used to have 90ms latency, and our new models drop it down to 45ms. We’re trying to bring latency down to more or less 0. At some level there’s a principle here: even if the demand out there is only for something as low as 30ms, I would still push and say, let’s try to make it 15ms or 10ms. Because there are so many things that you can make room for when latencies go down, and you can learn so much from that.”

On why startups can still out-research giant labs:

“Being in academia, there was almost this despair that big labs are going to do everything and nobody else has a place in the world. We don’t believe in that – there has never been a better time to do something new than now. In 20 years, there will still be new model companies spinning up that big labs can’t be doing. It takes a huge amount of work to make something new that works, and you have to have deep conviction to get it over the line.”

On why voice has a chance to be the default modality for AI:

“I also think there’s a huge opportunity for voice to be a computing interface in some interesting ways. For example, I think voice is going to be the default mode in which people interact with robotics and with computers. Just the bandwidth of information that you can provide through voice is so high – you can say one word, and it will carry a huge amount of meaning based on your tone, the way you speak… you can communicate a lot of information very quickly.”

Become a better AI founder every Wednesday with articles and episodes sent directly to your inbox.
explore untold stories in ai, directly from the industry's top founders.
Delivered to your inbox every Wednesday.