Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?



Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.


Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: