I was looking into something like this for linux recently. Didn't find anything ...

utrack · on Feb 12, 2025

https://github.com/m-bain/whisperX looks promising - I'm hacking away on an always-on transcriber for my notes for later search&recall. It has support for diarization (the speaker detection you're looking for).

I'm currently hacking away on a mix of https://github.com/speaches-ai/speaches + https://github.com/ufal/whisper_streaming though - mostly because my laptop doesn't have a decent GPU, I stream the audio to a home server instead.

But overall it's pretty simple to do after you wrangle the Python dependencies - all you need is a sink for the text files (for example, create a new file for every Teams meeting, but that's another story...)

mijoharas · on Feb 12, 2025

Any good solutions for capturing the audio streams and piping them where they're needed? (I.e both microphone and speakers. I was wondering if I needed to mess with pulseaudio and/or jack (I mean pipewire under the hood, but I think those APIs sit on top and might be clearer))

mijoharas · on Feb 12, 2025

Never mind, played around a little, and pulseaudio's cli API makes it easy enough to sling some loopback/virtual devices around that you can then read from easily enough.

ewuhic · on Feb 12, 2025

So which are you "hacking away on" in the end?