Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The answer to your question is:

ollama run mixtral

That's it. You're running a local LLM. I have no clue how to run llama.cpp

I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.



On a mac, https://drawthings.ai is the ollama of Stable Diffusion.


For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.


The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:

   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   wget https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf?download=true
   ./main -m ./mixtral-8x7b-v0.1.Q4_K_M.gguf -n 128


For us this may like a walk in the park.

For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)

This is just like dropbox case years ago.


This shows the value ollama provides

I only need to know the model name and then run a single command


The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).

On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.


When I get to the point of modification, I will go with Python. This is where the AI ecosystem is largely at

I stopped using C++ when Go came out, no interest in ever having to write it again.


It should be fairly obvious that one can find alternative models and use them in the above command too.

Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.

Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.

More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.


There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.

Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.

The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.


Your mileage may vary. It runs first time for me on an Apple Silicon Mac.


I'd rather focus on building on top of of LLMs than going lower level

Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box


Sure.

Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.


I'm aware, I don't need that amount of sophistication yet.

Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?


Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.


I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask

Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs


And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?


I focus on building the application rather than figuring out someone else preferred method for how I should work?

I use Docker Compose locally, Kubernetes in the cloud

I run in hot-reload locally, I build for production

I often nuke my database locally, but I run it HA in production

It is very rare to use the same technology locally (or the same way) as in production


There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.


Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.


>Hacker News


Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.

It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.


The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.


Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.


This will likely build a version without GPU acceleration, I think?


I was trying to get AMD GPU support going in llama.cpp a couple weeks ago and just gave up after a while. 'rocminfo' shows that I have a GPU and, presumably, rocm installed but there were build problems I didn't feel like sorting out just to play with a LLM for a bit.

Kudos if Ollama has this sorted out.


Builds with Metal support on my Mac M2


Check out EasyDiffusion.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: