> The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).
EDEN is clearly relevant prior work for HIGGS. But reducing HIGGS to “an extension of EDEN” seems unfair to the authors of HIGGS. Similar primitive, different problem setting, different constraints, different contribution.
Curious: where do you draw the line between “related prior work” and “an extension of EDEN”?
In the vLLM documentation quoted above, TurboQuant (which is a restricted version of EDEN) is referred to as a specific case of HIGGS. Note the symmetry: EDEN acts as a special case of HIGGS; hence, HIGGS functions as a generalization of EDEN.
In any case, the quantizer is indeed an extension, regardless of whether it was explicitly framed that way in the paper. I say this not to diminish their contribution at all, but just to clarify the relationship, as it was also stated in the vLLM doc.
90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.
I've had claude reference prior conversations when I'm trying to get technical help on thing A, and it will ask me if this conversation is because of thing B that we talked about in the immediate past
All these of these providers support this feature. I don’t know about ChatGPT but the rest are opt-in. I imagine with Gemini it’ll be default on soon enough, since it’s consumer focused. Claude does constantly nag me to enable it though.
Had chatgpt reference 3 prior chats a few days ago. So if you are looking for a total reset of context you probably would need to do a small bit of work.
Claude told me he can disable it by putting instructions in the MEMORY.md file to not use it. So only a soft disable AFAIK and you'd need to do it on each machine.
I ran into this yesterday and disabled it by changing permissions on the project’s memory directory. Claude was unable to advise me on how to disable. You could probably write a global hook for this. Gross though.