I understand that. The problem is that in many scenarios users would want to see transcripts of what they said alongside the model output. Like if I have a chat with a model about choosing a place to move to, I would probably also want to review it later. And when I review it, I will see: me: /audio record/ AI: 200-300m. No easy way to see at glance what the AI answer was about.
You can just run whisper on the conversations as a background job populating the text versions of all the user inputs, so it doesn't interfere with the real-time latency.