But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
I'd exploit the better multi-level cache hierarchy in CPUs and make the code NUMA aware. But still I wouldn't bet against a recent GPU card.
> But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
The post is about dot product, not matrix multiply. Dot product has no data reuse
But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
I'd exploit the better multi-level cache hierarchy in CPUs and make the code NUMA aware. But still I wouldn't bet against a recent GPU card.