The answer to your question is: ollama run mixtral That's it. You're running a l...

viraptor · on March 15, 2024

On a mac, https://drawthings.ai is the ollama of Stable Diffusion.

ghurtado · on March 15, 2024

For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.

jameshart · on March 15, 2024

The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:

   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   wget https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf?download=true
   ./main -m ./mixtral-8x7b-v0.1.Q4_K_M.gguf -n 128

ies7 · on March 16, 2024

For us this may like a walk in the park.

For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)

This is just like dropbox case years ago.

verdverm · on March 15, 2024

This shows the value ollama provides

I only need to know the model name and then run a single command

hnfong · on March 16, 2024

The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).

On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.

verdverm · on March 16, 2024

When I get to the point of modification, I will go with Python. This is where the AI ecosystem is largely at

I stopped using C++ when Go came out, no interest in ever having to write it again.

jameshart · on March 15, 2024

It should be fairly obvious that one can find alternative models and use them in the above command too.

Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.

Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.

More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.

roenxi · on March 16, 2024

There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.

Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.

The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.

jameshart · on March 16, 2024

Your mileage may vary. It runs first time for me on an Apple Silicon Mac.

verdverm · on March 16, 2024

I'd rather focus on building on top of of LLMs than going lower level

Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box

jameshart · on March 16, 2024

Sure.

Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.

verdverm · on March 16, 2024

I'm aware, I don't need that amount of sophistication yet.

Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?

jameshart · on March 16, 2024

Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.

verdverm · on March 16, 2024

I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask

Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs

eclectic29 · on March 15, 2024

And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?

verdverm · on March 15, 2024

I focus on building the application rather than figuring out someone else preferred method for how I should work?

I use Docker Compose locally, Kubernetes in the cloud

I run in hot-reload locally, I build for production

I often nuke my database locally, but I run it HA in production

It is very rare to use the same technology locally (or the same way) as in production

imtringued · on March 16, 2024

There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.

ramblerman · on March 16, 2024

Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.

read_if_gay_ · on March 16, 2024

>Hacker News

vidarh · on March 15, 2024

Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.

It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.

imtringued · on March 16, 2024

The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.

kergonath · on March 16, 2024

Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.

cjbprime · on March 16, 2024

This will likely build a version without GPU acceleration, I think?

UncleEntity · on March 16, 2024

I was trying to get AMD GPU support going in llama.cpp a couple weeks ago and just gave up after a while. 'rocminfo' shows that I have a GPU and, presumably, rocm installed but there were build problems I didn't feel like sorting out just to play with a LLM for a bit.

Kudos if Ollama has this sorted out.

jameshart · on March 16, 2024

Builds with Metal support on my Mac M2

icelain · on March 16, 2024

Check out EasyDiffusion.