Why would it not be possible to use multiple cores? Even though the plugins depend on the output of the previous one, they could sit on different cores, passing their output on from core to core. Even though that would not be parallel, being distributed, it could be faster (in some cases it might not).
> Why would it not be possible to use multiple cores?
Because the software doesn't do it (much; I've been told some applications do time-delayed mixing for stuff like delay) and the software is entrenched.
I wince every time I see someone say something can't be made parallel, but this actually not the way to do it. You would want a chunk of samples to be dealt with on the same CPU as it goes through plugin transformations. This would give data locality.
Then other CPUs would be free to start the next chunk of samples. The amount of parallelism is going to depend on the buffer size and number of samples each plugin needs to operate.
I don't think you can make a blanket statement here; it's going to really depend on the implementation details.
For example, if each plugin includes any kind of LUT, you don't have data locality either way, and you're much better off passing data between the plugins. If the plugins are complex, you'll be flushing your instruction cache, which will have to be refilled via random access as opposed to the linear reading of an audio segment.
Further, 192khz 24bit audio is only 0.5 megabytes per second. Skylake lists sustained L3 bandwidth as 18 bytes/cycle. This is enough to transfer 100k such audio streams simultaneously. It's very unlikely this is a bottleneck.
There are a lot of assumptions and some misunderstanding here. The data locality is about latency first and foremost. DDR3 at it slowest actually has 30GB /s of bandwidth and DDR4 can get past 70. Memory bandwidth is rarely the issue.
Also instructions shouldn't be huge, but more importantly they don't change. If the audio buffer stays on the same CPU, it doesn't change either.
Don't forget that writing takes time too. Writing can be a big bottleneck. Keep the data local to the same CPU and it doesn't have to go out to main memory yet.
Other things you are saying about 'flushing' the instruction cache, L3 bandwidth numbers and theoretical LUT that make a difference in one scenario and not the other without measuring (even though the whole scenario is made up) just seem like stabs in the dark to argue about vague what-ifs.
Skylake-X L3 latency is ~20ns. So if you build an SPSC queue between them, how many plugins are we chaining up linearly that this becomes an issue, or even a factor? 1000 might get us to 1ms?
OK, so we're left with a single core running a thousand plugins, and instruction cache pressure is a 'stab in the dark to argue about vague what-ifs'?
You take an absolutist view on what is so obviously a complicated trade off and talk down to me to boot. Maybe I know about high performance code, maybe I don't, maybe you do, maybe you don't. But I do know enough about talking to people on the internet to know to nip this conversation in the bud.
> Skylake-X L3 latency is ~20ns. So if you build an SPSC queue between them
The latency is mostly about initial cache misses. There is no reason to take the time to write out a buffer of samples to memory, only to have another CPU access them with a cache miss. One of many things things you are missing here is prefetching. Instructions will be heavily prefetched as will samples when accessesed in any sort of linear fashion.
Also you can't explicit use caches or send data between them, that is going to be up to the CPU, and it will use the whole cache heirarchy.
> You take an absolutist view
Everything dealing with performance needs to be measured, but I have a good idea of how things work so I know what to prioritize and try first. Architecture is really the key to these things and in my replies I've illustrated why.
> Maybe I know about high performance code, maybe I don't
It sounds like you have read enough, but haven't necessarily gone through lots of optimizations and recitified what you know with the results of profiling. Understanding modern CPUs is good for understanding why results happen, but less so for estimating exactly what the results will be when going in blind.
If you were as good as you claim, you would have directly answered my argument instead of hitting a strawman for five paragraphs.
Your experience led to overconfidence and you identified a ridiculous bottleneck for the problem domain. This is complicated and FPU heavy code running on few pieces of tiny data. And yes, riddled with LUTs. The latency cost you're worried about is in the noise.
Instead of doing some back of the envelope calculations and realizing your mistake, you double down, handwave and smugly attack me.
Your conclusions are bullshit, as is your evaluation of my experience. For anyone else that happens to be reading, I suggest taking a look through the source of a few plugins and judging for yourself.
There is no need to be upset, there is no real finality here, everything has to be measured.
That being said the LUTs would follow the same pattern as execution - all threads would use them and if they are a part of the executable they don't change. This combined with prefetching and out of order instructions means that their latency is likely to be hidden by the cache.
New data coming through however would be transformed, creating more new data. While the instructions and LUTs aren't changing the new data being created on each transformation can either be kept locally so it doesn't incur the same write back penalties and cache misses by
due to allocating new memory, writing to it and eventually getting it to another CPU.
If the same CPU is working on the same memory buffer there is no need to try to allocate them for every filter or manage lifetimes and ownership of various buffers.
If you took time to read the code linked, you'd notice two things:
1) It's very common for the processing of samples to not be independent, but have iterative state; for example delay effects, amplifiers, noise gates...
2) The work done per sample is substantial with nested loops, trig functions and hard to vectorize patterns
So not only does your technique break the model of the problem domain, the L3 latency you're so worried about when retrieving a block of samples is comparable to a single call to sin, which in some cases we're doing multiple times per sample.
Now you conflate passing data between threads with memory allocation, as though SPSC ring buffers aren't a trivial building block. This is after lecturing me on my many "misunderstandings"... if you're willing to assume I'm advocating malloc in the critical path (!?), no wonder you're finding so many.
I'm not upset, I'm just being blunt. Ditch the cockiness, or at least reserve it for when your arguments are bulletproof.
I'm not sure where this is coming from. If one cpu is generating new data and another CPU is picking it up, it's wasting locality. If lots of new data is generated it might get to other CPUs though shared cache or memory, but either way it isn't necessary.
Data accessed linearly is prefetched and latency is eventually hidden. This, combined with the fact that instructions aren't changing and are usually tiny in comparison, is why instruction locality is not the primary problem to solve.
The difference it makes it up to measurement, but trying to pin one filter per core is a simplistic and naive answer. It implies that concurrency is dependent on how many different transformations exist, when the reality is that the number of cores.that can be utilized will come down to the number of groups of data that can be dealt with without dependencies.
> SPSC ring buffers
That's a form of memory allocation. When you fabricate something to argue against, that's called a straw man fallacy.
Are you claiming the act of writing bytes to a ring buffer as being memory allocation? In that case I misunderstood what you were saying and it was indeed a straw man.
In any case, we're clearly not going to find common ground here.
These cannot run at the same time as the output of one feeds into another one. Data travelling from one core to another could mean additional performance loss. Some plugins use multiple cores if whatever they calculate can be parallelised, but still the quicker it can be done the more plugins you can run in your chain.
This is silly. A bottleneck for audio processing is a particular product's flaw, not an intrinsic challenge of audio. A modern machine capable of doing interactive, high-resolution graphics rendering or high-definition movie rendering can do a stupendous amount of audio processing without even trying.
The data rates for real-time audio are so much smaller than modern memory system capabilities that we can almost ignore them. A 192 kHz, 24-bit, 6-channel audio program is less than 3 MB/s, thousands of times slower than a modern workstation CPU and memory system can muster.
The stack of audio filters you describe are a natural fit for pipelined software architectures, and such architectures are trivially mapped to pipelined parallel processing models. Whatever buffer granularity one might make in a single-threaded, synchronous audio API to relay data through a sequence of filter functions can be distributed into an asynchronous pipeline, with workers on separate cores looping over a stream of input sample buffers. It just takes an SMP-style queue abstraction to handle the buffer relay between the workers, while each can invoke a typical synchronous function. Also, because these sorts of filters usually have a very consistent cost regardless of the input signal, they could be benchmarked on a given machine to plan an efficient allocation of pipeline stages to CPU cores (or to predict that the pipeline is too expensive for the given machine).
Finally, audio was a domain motivating DSPs and SIMD processing long before graphics. An awful lot of audio effects ought to be easily written for a high performance SIMD processing platform, just like custom shaders in a modern video game are mapped to GPUs by the graphics driver.
I don't think you're wrong in a technical sense, but the human factors in a contemporary DAW environment are imposing a huge penalty on what's possible.
The biggest issue is that we're using plugins written by third parties to a few common standards. Even when the plugins themselves are not trying to make use of a multicore environment, you still get compatibility bugs and various taxes on re-encoding input and output streams to the desired bit depth and sample rate. It can really throw a wrench into optimizing at the DAW level because you can't just go in and fix the plugins to do the right thing.
Then add in the widely varying quality of the plugin developers, from "has hand-tuned efficient inner loops for different instruction set capabilities" to "left in denormal number processing, so the CPU dies when the signal gets quiet." Occasionally someone tries to do a GPU-based setup, only to be disappointed by memory latency becoming the bottleneck on overall latency(needless to say, latency is really prioritized over throughput in real-time audio).
Finally, the skillsets of the developers tend to be math-heavy in the first place: the product they're making is often something like a very accurate simulation of an analog oscillator or filter model, which takes tons of iterations per sample. Or something that is flinging around FFTs for an effect like autotune. They are giving the market what it wants, which is something that is slightly higher quality and probably dozens or hundreds of times more resource-hungry to process one channel.
If all you're doing is mixing and simple digital filters, you're in a great place: you can probably do hundreds of those. But we've managed to invent our way into new bottlenecks. And at the base of it, it's really that the tooling is wrong and we do need a DSP-centric environment like you suggest. (SOUL is a good candidate for going in this direction.)
This is a simple fact of life and downvoting isn't going to change it.
Plugin cannot start processing before it gets data from previous plugin (sure it can do some tricks like pre-computing coefficients for filters etc). How are you going to get around it?
What's happening within a plugin of course can be parallelised, but other than that, the processing is inherently serial.
If a computing a filter takes X time and a length of the buffer is Y you can only compute so many filters (Y/X) before it starts stuttering. You can spread that across different cores, but these filters cannot be processed at the same time, because each needs the output of the previous one.
Pipelining means that each stage further down the pipeline is processing an "earlier" time window than the previous stage. They don't run concurrently to speed up one buffer, but they run concurrently to sustain the throughput while having more active filters.
For N stages, instead of having each filter run at 1/N duty cycle, waiting for their turn to run, they can all remain mostly active. As soon as they are done with one buffer, the next one from the previous pipeline stage is likely to be waiting for them. This can actually lower total latency and avoid dropouts because the next buffer can begin processing in the first stage as soon as the previous buffer has been released to the second stage.
I think this is one of the most misunderstood problem these days.
Your idea could work if the process wasn't real-time. In real-time audio production scenario you cannot predict what event is going to happen so you cannot simply just process next buffer, because you won't know in advance what is needed to be processed.
At the moment these pipelines are as advanced as they can be and there is simply no way around being able to process X filters in Y amount of time to work in real-time.
If you think you have an idea that could work, you could solve one of the biggest problems music producers face that is not yet solved.
Something like a filter chain for an audio stream is truly the textbook candidate for pipelined concurrency. Conceptually, there are no events or conditional branching. Just a methodical iteration over input samples, in order, producing output samples also in order.
can instead be written as a set of concurrent worker loops.
Each worker is dedicated to running a specific filter function, so its internal state remains local to that one worker. Only the intermediate sample buffers get relayed between the workers, usually via a low-latency asynchronous queue or similar data structure. If a particular filter function is a little slow, the next stage will simply block on its input receive step until the slow stage can perform the send.
This is how it is typically being done. This is not a problem. Problem is that being concurrent, end to end this process is serial, so you can't process any element of this pipeline in parallel. You can run only so many of those until you run out of time to fill the buffer.
I think it could be helpful for you to watch this video:
https://www.youtube.com/watch?v=cN_DpYBzKso
Sorry for the late reply. We have to consider two kinds of latency separately.
A completely sequential process would have a full end-to-end pipeline delay between each audio frame. The first stage cannot start processing a frame until the last stage has finished processing the previous frame. In a real-time system, this turns into a severe throughput limit, as you start to have input/output overflow/underflow. The pipeline throughput is the reciprocal of the end-to-end frame delay.
But, concurrent execution of the pipeline on multiple CPU cores means that you can have many frames in flight at once. The total end-to-end delay is still the sum of the per-stage delays, but the inter-frame delay can be minimized. As soon as a stage has completed one frame, it can start work on the next in the sequence. In such a pipeline, the throughput is the reciprocal of the inter-frame delay for the slowest stage rather than of the total end-to-end delay. The real-time system can scale the number of pipeline stages with the number of CPU cores without encountering input/output overflow/underflow.
Because frame drops were mentioned early on in this discussion, I (and probably others who responded) assumed we were talking about this pipeline throughput issue. But, if your real-time application requires feedback of the results back into a live process, i.e. mixing the audio stream back into the listening environment for performers or audience, then I understand you also have a concern about end-to-end latency and not just buffer throughput.
One approach is to reduce the frame size, so that each frame processes more quickly at each stage. Practically speaking, each frame will be a little less efficient as there is more control-flow overhead to dispatch it. But, you can exploit the concurrent pipeline execution to absorb this added overhead. The smaller frames will get through the pipeline quickly, and the total pipeline throughput will still be high. Of course, there will be some practical limit to how small a frame gets before you no longer see an improvement.
Things like SIMD optimization are also a good way to increase the speed of an individual stage. Many signal-processing algorithms can use vectorized math for a frame of sequential samples, to increase the number of samples processed per cycle and to optimize the memory access patterns too. These modern cores keep increasing their SIMD widths and effective ops/cycle even when their regular clock rate isn't much higher. This is a lot of power left on the table if you do not write SIMD code.
And, as others have mentioned in the discussion, if your filters do not involve cross-channel effects, you can parallelize the pipelines for different channels. This also reduces the size of each frame and hence its processing cost, so the end-to-end delay drops while the throughput remains high with different channels being processed in truly parallel fashion.
Even a GPU-based solution could help. What is needed here is a software architecture where you run the entire pipeline on the GPU to take advantage of the very high speed RAM and cache zones within the GPU. You only transfer input from host to GPU and final results back from GPU to host. You will use only a very small subset of the GPU's processing units, compared to a graphics workload, but you can benefit from very fast buffers for managing filter state as well as the same kind of SIMD primitives to rip through a frame of samples. I realize that this would be difficult for a multi-vendor product with third-party plugins, etc.
Assuming your samples are of duration T, and you need X CPU time to fully process a sample through all filters. Pipelining allows you to process audio with X > T, nearly X = N * T for N cores, but your latency is still going to be X.
If it is possible to process with small samples (T), with roughly correspondingly small processing time (X), there shouldn't be a problem keeping the latency small with pipelining. If filters depend on future data (lookahead), it is plausible reducing T might not be possible. Otherwise, it should be mostly a problem of weak software design and lots of legacy software and platforms.
You cannot run the pipeline in parallel. Sure you can have a pipeline and work the buffers on separate cores, but the process is serial. If it was as simple as you think it would have been solved years ago. There are really bright heads working in this multi billion industry and they can't figure that out. Probably because that involves predicting the future.
> These cannot run at the same time as the output of one feeds into another one.
This precludes parallel processing of individual packets, but does not prevent concurrent processing of packets.
Plugin A accepts a packet, processes it, outputs it. Plugin B accepts a packet from A, processes it, outputs it. Plugin C accepts a packet from B, processes it, outputs it. [...] Plugin G accepts a packet from F, processes it, outputs it.
Everything is serial so far. Got it. Here's the thing though: Plugin A processes packet n, Plugin B processes packet n-1, Plugin C processes packet n-2, [...] Plugin G processes packet n-6. Now you have 7 independent threads processing 7 independent data packets. As long as the queues between plugins are suitably small you won't introduce latency.
The mental model here should be familiar to anyone in the music industry; each pedal between the instrument and the amp is a plugin, each wire is a queue. Each pedal processes its data concurrently (but not parallel with) with every other pedal.
It's relatively common in game development for AI/physics to generate the data for frame n, while graphics displays frame n-1. (there's a natural, fairly hard sequential barrier separating physics from graphics, and there's a hard sequential barrier when the frame is finally shipped off to the GPU) Especially on consoles that have 8 core CPUs but each core is really slow. PS4/XBoxOne use the AMD Jaguar architecture, which was the mobile variant of Excavator. The single core performance of these CPUs are absolutely atrocious, but the devs make it work for latency sensitive activities like gaming.
> Data travelling from one core to another could mean additional performance loss.
Only if it is evicted from the L3 cache, and the 3950X has 64MB of it. That's over a second(!!) of latency at 16 channel+192kHz+32 bits/sample audio.
Speaking of channels, that seems like a natural opportunity for parallelism.
I get that legacy code is legacy code, and a framework designed to run optimally on Netburst isn't necessarily going to run optimally on Zen 2. (or any other CPU from the past decade) But this is an institutional problem, not a technical one. It sounds to me like somebody needs to bite the bullet and make some breaking changes to the framework.
> Everything is serial so far. Got it. Here's the thing though: Plugin A processes packet n, Plugin B processes packet n-1, Plugin C processes packet n-2, [...] Plugin G processes packet n-6. Now you have 7 independent threads processing 7 independent data packets. As long as the queues between plugins are suitably small you won't introduce latency.
The process is realtime so you cannot receive events ahead of time. It is actually running how you describe, but you can only process so much during the length of a single buffer. Typically solution is to increase the length of the buffer, but that increases latency or reduce the length of the buffer but that introduces overhead.
> Each pedal processes its data concurrently (but not parallel with) with every other pedal.
That's how it works.
> The single core performance of these CPUs are absolutely atrocious, but the devs make it work for latency sensitive activities like gaming.
I am talking about realistic simulations. You can definitely run simple models without latency, that's not a problem.
> Only if it is evicted from the L3 cache, and the 3950X has 64MB of it. That's over a second(!!) of latency at 16 channel+192kHz+32 bits/sample audio.
That's nothing. Typical chain can consists of dozens of plugins times dozens of channels.
There is no problem with such simple case as running 16 channels with simple processing.
> Speaking of channels, that seems like a natural opportunity for parallelism.
That works pretty well. If you are able to run you single chain in realtime you can typically run as many of them as you have available cores.