Hacker News new | past | comments | ask | show | jobs | submit login

The feature I want is speaker differentiation - I want to feed in an audio file and get back a transcript with "Speaker 1: ..., Speaker 2: ..." indications.

That plus timestamps would be incredible.

The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.




I had good results with pyannote and the following model for that use case in the past https://huggingface.co/pyannote/speaker-diarization-3.1


I thought Deepgram already did speaker diarization (which is differentiation) pretty well. That and it can include timestamps plus other metadata.


WhisperX does all of this, I use it all the time to transcribe meeting notes. Both speaker differentiation and individual word timestamps.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: