So everyone is supposed to do all of their testing on H100's?

alecco · on Dec 11, 2024

4090 (2022) PCIe 4.0 x16 is quite decent. The major limit is memory, not bandwidth. And 3090 (2020) is also PCIe 4.0 x16, and used cards are a bargain. You can hook them up with Nvlink.

Nvidia is withholding new releases but the current hardware has more legs with new matrix implementations. Like FlashAttention doing some significant improvement every 6 months.

Nvidia could make consumer chips with combined CPU-GPU. I guess they are too busy making money with the big cloud. Maybe somebody will pick up. Apple is already doing something like it even on laptops.

_zoltan_ · on Dec 11, 2024

get a GH100 on lambda and behold you have 900GB/s between CPU memory and GPU, and forget PCIe.

imtringued · on Dec 12, 2024

That doesn't add up, because you're now exceeding the memory bandwidth of the memory controller. I.e. it would be faster to do everything, including the CPU only algorithms, in far away VRAM.

gpderetta · on Dec 12, 2024

latency might be higher and latency usually dominates CPU side algorithms.

saagarjha · on Dec 12, 2024

Where are you seeing 900 GB/s?

KeplerBoy · on Dec 12, 2024

The 900 GB/s are a figure often cited for Hopper based SXM boards, it's the aggregate of 18 NVLink connections. So it's more of a many-to-many GPU-to-GPU bandwidth figure.

https://www.datacenterknowledge.com/data-center-hardware/nvi...

shaklee3 · on Dec 13, 2024

this is bidirectional bw and it's gh200, not gh100