From his early work in physics and mathematical simulation to leading Vision Pro development at Apple, Amit Jain has consistently pushed the boundaries of how computers process and understand our world. Now as the founder and CEO of Luma AI, he's tackling an even bigger challenge: building what he calls a "universal imagination engine."
In this episode of Barrchives, Amit shares his perspective on why unified AI models outperform specialized ones, how visual data transforms AI reasoning, and what it takes to build infrastructure capable of training on millions of videos. But underlying these technical insights is a broader vision about technology's role in human progress.
Watch the full episode to learn how Luma is working to transform the future of human-computer interaction, and read on for some of the most interesting moments from our conversation.
If we want to climb up on the ladder of Kardashev scale, humanity has to become a Kardashev I civilization and then II civilization. We need to wield more and more matter, energy, and information together. That's not going to happen with chisels and saws - it's going to happen when we use higher level technologies to manipulate bigger and bigger entities.
Unless you're able to jointly train and backprop through, it's very difficult to imagine how the sum of the parts is better than the parts themselves. We have seen that in the past eras of machine learning. With computer vision, people tried to combine all these systems together, and then the boundaries were very fragile.
This world is about a set of models or a universal model that has been pre-trained on audio, video, language, and basically humanity's entire digital footprint. So we believe that actually leads to way better reasoning capabilities than whatever we have in LLMs. What we have in LLMs is substantially better than what we had in previous generation of the models, but it's still extremely gated... if you're able to observe that much more data and that much more correlated data, you have way better reasoning capabilities.
Visual data has a lot of redundancy, but the flip side of that is there's just a lot of it, a lot, a lot, a lot of it... We train the Dream Machine, Dream Machine V2 especially, that's out now, that was trained on 1.2 quintillion tokens, which is 1200 trillion tokens... And this data is in tens and tens of petabytes.
Our path there is, instead of scaling the models by 100x and 1000x and then making them extremely untenable for deployment and making them so expensive that nobody can have access to it, you'd rather want to train more intelligent models and train them on about 1000 times more data. If you can do that, you can not only make them really fast, really cheap, but you can also deploy them in places and for use cases that like, you people don't think today.