Youtube probably have stats on viewer-happyness vs bitrate.
I would take a guess that a higher bitrate = longer loading times, and viewers care far more about an extra few second of buffering than they care about audio quality, especially when they don't have the original to compare to.
The audio data is miniscule compared to the video data, and the size of it is tied to the video quality level. And everything is streamed in chunks. It'd only amount to milliseconds of extra buffering.
But now you're comparing median video bitstream with peak audio bitstream.
YouTube uses variable bitrate for audio, which can vary dramatically in size. Your example of podcasts or "talking heads" is actually perfect. Most encoders are extremely efficient at compressing voices, as they will only have to encode 30-300Hz, and voices have less data variation than images.
Image encoding is just very complex. It'll get better and better, but audio encoders of the same generation will also improve.
Couldn’t you do adaptive Bitrate and start streaming low-bitrate for a few seconds and then switch to higher quality once the video is already playing?
Yes it's very possible to do this.
Without seeking support it's trivial, just instruct the encoder to encode with a low bitrate for a few seconds and then increase it.
To support seeking you could encode a low bitrate stream, and a high quality stream, and then a number of ramps between these. So when you seek you start with the low bitrate stream and then after a few time units go on the ramp to the high quality stream.
While nothing that's been said here is inherently wrong per se, a sample YT page load is ~5s to DOMContentLoaded, and without counting the video content, transfers ~7 MiB worth of requests & ~95 requests for me, and visually, the entire page feels like it loads twice. (I thought it was redirecting, but the inspector says nope, that's a single page load.)
… while yeah… a lower bitrate upfront might lower the required bandwidth and thus, latency, to get enough of a buffer to start playback … all the bloat on the page would be a better first port of call.
I would take a guess that a higher bitrate = longer loading times, and viewers care far more about an extra few second of buffering than they care about audio quality, especially when they don't have the original to compare to.