Most of what you hear about AI right now is in text, but Mikey Shulman (co-founder and CEO of Suno) would tell you that audio is a much more interesting medium to work with. How do you use AI to generate music? What makes audio data uniquely difficult to parse? And how do you build audio models that cater to unique, subjective human preferences on music?
Suno is building a future where anyone can make great music. In this episode of Barrchives, I sat down with Mikey to talk about how they do what they do, from why they chose a transformer-based architecture to how they test new models when outputs are so subjective.
“There’s something special about Autoregression – and the theory on this isn’t super well developed – that it figures it out bit by bit, which tends to make for more interesting music. The cartoon version of this is that Autoregression might make for really beautiful music that sounds like it was poorly recorded, while Diffusion models make great sounding elevator music that’s a little boring.”
“One of the biggest challenges is that it’s big. It’s unwieldy, it’s poorly organized, there’s no common crawl out there. You can’t search over it, you can’t catalogue it very easily, and all of these things push people away from working with audio. For us, by virtue of just loving music, it was really a labor of love to push through this and organize audio. We spent a lot of time figuring out the right way to tokenize audio that would make interesting music.”
“Like the text world, you need quality and you need quantity, and they don’t necessarily need to come from the same place. But something we feel strongly about is not passing judgement on music, beyond obvious issues like low bit depth, wrong sample rate, etc. We want to pass that on to inference and let the user decide what they want to hear. And this becomes an incredible data collection mechanism – if our model thought two songs were of equal quality, and yet a human says one is better than the other. It’s a great way to eventually steer these models towards human preferences.”