One very important quote from the official announcement:

> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.

The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.

The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.