I was looking into something like this for linux recently. Didn't find anything obviously simple
(considered hooking up whisper.cpp and a bit of audio magic to make it at least transcribe, but it firstly seemed like a fair bit of a pain and secondly I couldn't think of a nice way to do speaker detection.)
https://github.com/m-bain/whisperX looks promising - I'm hacking away on an always-on transcriber for my notes for later search&recall. It has support for diarization (the speaker detection you're looking for).
But overall it's pretty simple to do after you wrangle the Python dependencies - all you need is a sink for the text files (for example, create a new file for every Teams meeting, but that's another story...)
Any good solutions for capturing the audio streams and piping them where they're needed? (I.e both microphone and speakers. I was wondering if I needed to mess with pulseaudio and/or jack (I mean pipewire under the hood, but I think those APIs sit on top and might be clearer))
Never mind, played around a little, and pulseaudio's cli API makes it easy enough to sling some loopback/virtual devices around that you can then read from easily enough.
(considered hooking up whisper.cpp and a bit of audio magic to make it at least transcribe, but it firstly seemed like a fair bit of a pain and secondly I couldn't think of a nice way to do speaker detection.)