> This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers ca...

noosphr · 2025-03-21T08:51:28 1742547088

After throwing the average meeting as an mp3 to your system, yes, you have diarization solved much better than everyone else I've tried by far. I'd say you're 95% of the way to being good enough for becoming the backbone of monolingual corporate meeting transcription, and I'll be buying API tokens the next time I need to do this instead of training a custom model. Your transcription however isn't that great - but good enough for LLMs to figure out a minutes of the meeting.

That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.