Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While Vulkan can be a good fallback, for LLM inference at least, the performance difference is not as insignificant as you believe. I just ran a test on the latest pull just to make sure this is still the case on llama.cpp HEAD, but text generation is +44% faster and prompt processing is +202% (~3X) faster with ROCm vs Vulkan.

Note: if you're building llama.cpp, all you have to do is swap GGML_HIPBLAS=1 and GGML_VULKAN=1 so the extra effort is just installing ROCm? (vs the Vulkan devtools)

ROCm:

  CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf
  ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
  ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  | model                          |       size |     params | backend    | ngl |          test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
  | llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |         pp512 |      3258.67 ± 29.23 |
  | llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |         tg128 |        103.31 ± 0.03 |

  build: 31ac5834 (3818)
Vulkan:

  GGML_VK_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf
  | model                          |       size |     params | backend    | ngl |          test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
  ggml_vulkan: Found 1 Vulkan devices:
  Vulkan0: Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64
  | llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         pp512 |       1077.49 ± 2.00 |
  | llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         71.83 ± 0.06 |

  build: 31ac5834 (3818)

EDIT: HN should really support markdown...


> ...so the extra effort is just installing ROCm? (vs the Vulkan devtools)

The problem with ROCm is that for non-bleeding edge AMD cards you have to install an out of date unsupported version of it because the $current version does not support your card. And that means containerization woes. If you're going to spend $800 on a top of the line current generation video card anyway then you'll have fewer problems (for a few years).

Also, the vulkan vs. rocm performance difference for non-bleeding edge non-top of the line cards is smaller.


Radeon RX 7900 XTX is RDNA3 but I wonder if llama.cpp is using the Vulkan matrix instructions wmma and mfma.

I have not noticed any remarkable differences between Vulkan and ROCm when using IREE but it's not a turnkey solution yet[1].

[1] <https://github.com/nod-ai/sharktank/blob/main/docs/model_coo...>


Any chance we might see Vulkan extensions to close this performance gap? Was really hoping Intel and AMD would team up to vreate an open standard that we could all have installed by default, but instead we get these clumsy vendor-specific solutions...


I think that it is very unlikely that the performance difference is caused by anything that could be solved with a Vulkan extension.

Vulkan only exposes the raw compute capabilities of the hardware and any well optimized Vulkan application can reach the full performance, but you need to write such optimized code.

On the other hand, ROCm, like CUDA, includes optimized libraries for certain applications, like rocBLAS.

It is likely that here the ROCm backend uses optimized library functions, perhaps from rocBLAS, while the Vulkan backend might use some generic functions for linear algebra, which are not optimized for the AMD GPUs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: