How many producers and how many consumers is that 650 nanoseconds?
I have pinned threads to even numbered cores with pthread_setaffinity_np and that seems to have evened out the MPMC ringbuffer - 2 producers 2 consumers to under 400 nanoseconds, usually under 1000 nanoseconds. I think hyperthreading causes problems.
EDIT: Would you like to chat about this? I would like to! My email is in my profile.
I’ve been down the path you’re on a few times and I love the pursuit. Have built my own over the years about 4 times.
Hardware was much slower in those days so my lower barrier was 650ns. Things got worse appreciably as a function of the number of producers I found.
Some of my most sleepless nights. The funnest nights.