Absolutely. Difference in Q6 vs Q8 is not as immediately noticeable, but if I test by starting from a blank slate context and giving it the same complicated task with Q4 vs a Q8 GGUF file loaded, the difference is apparent. The Q4 will struggle or do 'stupid' things with even simple bash or python. Q4 might not be as noticeable for conversational purely text one on one interaction with an LLM, but when you dig deeper into something that's more esoteric in a training dataset than a chat conversation, absolutely a big gap there.
I think some of the folks in the local llm social media communities are using them for things like company-hosted customer service chat bots, or purely english text writing stuff where Q4 will probably not cause a problem. For more discrete technical work I stick pretty much exclusively to Q8.
I have not spent a lot of time running FP16 'full precision' versions of some things, but as the other commenter says, it's not much difference. There's a really wide array of benchmarks and tests from a lot of third parties unrelated to the trainer of the AI models that shows at most a two percent difference in score and capability between BF16 and Q8.
Q8 quant is very minimal fall off in terms of KLD against the lab 16 bit. If you have the memory for BF16 KV-cache (which is usually easier to stomach) then the Q8 is very close. But even Q8 quant model with Q8 KV-cache is very close.
Smaller quants for the model start to fall off but more importantly, smaller KV-cache quants fall off much faster so avoid less than Q8 there.
It’s not a general rule, and depends highly on the model and the quantisation used. Don’t guess, Unsloth sometimes publish graphs in their tutorials showing the error rate vs file size… sometimes Q4 is great, other times I go for Q6
Hi. Is there a significance to that date or that commit? The commit doesn't look very special, and Git was apparently being used from early April already: https://en.wikipedia.org/wiki/Git
I have always said please and thank you to LLMs, not to increase accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.
Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.
Google searches being keyword based, rather than simulated conversations?
The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).
It is rather hard to lose of habit of using search engine with keywords given the change took place without much fanfare. I have no problem using sentences with the current ai tools through.
I didn't used to but I do now that the searches go straight to an LLM. I almost always find the model output to be much more useful than the list of search results.
I don't. I was recently doing some searching for information I thought AI would be good for: fuzzy natural language search with some conditions. And it was, but ...
Gemini at least is not great at citing and picking sources. Or providing multiple sources for the same thing.
It tends to stop at threes. So if you want more, you have to prompt it uselessly, like: "any more?"
I searched for "Hey Google" and got this in response:
Hey! I'm here and ready to help. What’s on your mind today? Whether you need to look up information, plan a trip, or get things done, just let me know!
Is it worth getting worse results for that reason? From the article:
"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "
I am not polite to LLMs because I do not want to anthropomorphise them.
I guess it's about habit. In the end you are communicating. If I get into the habit of being rude while communicating with a machine, I would be afraid of this habit spilling over to my communication with other humans.
I also remember reading a long time ago someone who wrote that they wanted to be polite to an LLM because after they prompted it to learn about whether politeness was good for improving accuracy of responses, they got a message that led them to conclude that politeness could probably help. It seems a bit odd then because I have heard so much about how people use LLMs' responses about themselves to learn about LLMs themselves, but that seems like it is a suspicious approach.
Say you put the current time down to the second in the system prompt, which is the message that goes in front of the entire conversation, then basically nothing will be cached, every agent turn needs to ingest the entire session over and over. Contrast to not doing that, and the backend can leverage caching all the way up to the latest message, as nothing until then changed.
I don't think any the agents breaks caching on every turn, but they might do things like current list of files, or available tools depending upon plan/build mode... or lots of other things that breaks caching multiple times during a session.
Probably not that exactly, but there is a tradeoff between effectiveness of the prompt and cache hit rate. If putting the user’s datetime in the middle of the prompt scores higher on evals but worsens cache hits, versus at the end of the prompt where it’s cache friendly but may not be as effective, what do you do?
This is still art as much as science and the different harnesses take different approaches.
Obviously not, most agents properly keep previous messages unchanged, at least the major ones I've been digging into the source off. Also, everything would get so much slower, that even developers creating their own agents would notice quickly how much slower theirs is, if they fuck this up.
> Anthropic offers a formal copyright indemnification policy for its enterprise customers using the Claude API. The policy protects businesses from copyright infringement claims arising from authorized use of Claude or its generated outputs
I've used Claude to fix/reconstruct & build leaked Win2k3 on Linux with original toolchain via Wine. This approach included full gdi sources reconstruction. I just don't know what to do with this, it's kinda difficult to "wash" on this scale
reply