The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).
On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.
It should be fairly obvious that one can find alternative models and use them in the above command too.
Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.
Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.
More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.
There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.
Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.
The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.
Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.
Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.
I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask
Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs
There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.
The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.
I was trying to get AMD GPU support going in llama.cpp a couple weeks ago and just gave up after a while. 'rocminfo' shows that I have a GPU and, presumably, rocm installed but there were build problems I didn't feel like sorting out just to play with a LLM for a bit.
ollama run mixtral
That's it. You're running a local LLM. I have no clue how to run llama.cpp
I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.