My understanding is that they use a side effect of the Bark model. The comment h...

My understanding is that they use a side effect of the Bark model. The comment https://news.ycombinator.com/item?id=35647569 from JonathanFly probably explains it well. If you train your model on a massive amount of audio mixes of lyrics+music then prompting lyrics alone pulls the music with it as when the comment suggested that prompting context-correlated texts might pull the background noises usual for such context. Already while writing this I imagine training with a huge set of publicly performed poetry pieces that would allow generating novel performances of artificial poets with novel prompts. This is different to riffusion.com approach, where works the genius idea of more or less feeding spectrograms as images to Stable Diffusion.