Hacker Newsnew | past | comments | ask | show | jobs | submit | pixelesque's commentslogin

It wasn't "all their source code", it was the source code to Claude Code: not really any of their internal secret sauce, at least directly.

it wasn't stolen either. an employee accidentally included a source map file with the release.

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

As someone with a lot of experience in this area doing image processing and rendering for VFX (including writing image readers and writers for my own software and commercial VFX software), I think you might be forgetting that colourspace conversion (to sRGB 'linear' rec709 for old-school SDR, but other more wider gamuts for newer formats) would happen after this, so the 'squish' of the dynamic range would happen after loading.

Also, a lot of workflows for image processing and compositing do assume that 0 means zero, whether correctly or not (often incorrectly). So there are often assumptions that for 8-bit, 0u maps to 0.0f and 255 maps to 1.0f for things like masking or alpha: as soon as you have 0 values which become just over 0.0, you then have artifacts because some code somewhere is using a hard threshold of 0.0 to mask some other operation, and vice-versa for 1.0 with alpha, where suddenly because the 255 values are no longer 1.0f, you have very slightly see-through objects (often only visible in certain situations or when pixel-peeping) after pre-multiplication.

(Same thing can happen when 254 becomes 1.0f after +0.5 with masking).


I think more to the point, if 0 doesn't represent 0.0, and 255 doesn't represent 1.0, congratulations you've just lost your additive and multiplicative identities and most of the math used in colors falls apart.

The argument for 0-256 feels compelling when thinking about the physical display, but it seems like a very poor fit for any digital image processing or rendering.


> Remember how the 0 and 255 bins poked slightly beyond the [0,1][0,1] range’s edges? In the standard approach, the range of representable values is actually [−0.5/255,255.5/255][−0.5/255,255.5/255], meaning the bins are spaced further apart than strictly needed for [0,1][0,1] inputs

This is of course silly: the "range of representable values" of floating point colour components is [0,1] independent of quantization and how an invalid input would be quantized is irrelevant.

Looking at the actual "big picture" there are 256 representable values and (taking into account gamma correction, arbitrary ranges other than [0,1], deliberately nonuniform quantization bins, and other plausible complications) their correspondence to 256 floating point values should be regarded as a generic lookup table, abandoning all hope of using elegant and cheap formulas and making it obvious than encoding and decoding differently is not an option.


good point - alpha is a notable exception, it is not luminance

If you're a beginner, or just want something which works quickly, sure.

However OIIO is far from perfect in all situations (having had to debug and fix issues with its mip-map generation filtering code in the past), so don't always assume that just because there's a mature open source library out there doing something that it's always perfect.


sure of course nothing is perfect and oiio has a lot of surface area / is still oss. thats good advice.

ive just seen a lot of "ai researchers" who are getting into professional image processing and are both beginners and want things quickly and so could do much worse than just starting from what they get out of oiio. especially for a lot of the non-obvious stuff (more of that in color handling than just the io stuff though)


Very likely, but isn't this post claiming that bijou64 is safer than LEB128 for the situation of adversarial varints?

Why are people down-voting this?

It's generally correct: If you're self-employed / sole trader and operating outside IR-35, there's no way the HMRC can know how much you were paid as they don't have the info, so they can't know how much tax you owe.

In other situations for payrole / salary (like PAYE for example) they do have the info, as companies have to submit it, so generally people in those situations don't have to submit tax returns (unless they have significant capital gains).

I do think it's a bit annoying you have to declare tax on interest since 2016 if it's over £1000 - previously banks would take it out automatically, and this is still done in other countries (NZ for example).


Don’t let facts get in the way of a good narrative it seems.

Out of interest, what machine and model are you running it on?

I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.

What sort of speed should I be expecting?

I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.

Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)

I'm not expecting it to be instant, but what I'm currently seeing is not really usable.


There are two flavors of Qwen 3.6:

- A 27B "dense" model

- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.

For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.

The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.


For coding tasks 27B is reported to be much more effective, altho you can probably only run 4b or 5b quants @ this memory.

Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.


I played around with local LLMs on my M4 Max 64GB this weekend and this is exactly what I found. I put Opus 4.7 "head to head" on the same task as Qwen 3.6 and a few other local models. The 35B did not perform well IME - it needed a lot of handholding and even then the final result did not work until a few more tweaks, while Claude one shot the task. The 27B was much better and also one shot the task, but took about ~55min as opposed to about ~15min for Claude. The 27B is probably something that I could happily run for many use cases if I had some faster hardware... the main problem there seems to be that at larger context sizes, prompt decoding can take several minutes.

This matches my experience too. The little a3b model is quite capable for its size class, as is the 27B model, but it’s still an order of magnitude less effective than Claude on the “effectiveness / time” curve

Using omlx on the M1 max I get about 15tps from 27b

interesting! I might give omlx another chance, thank you

Thank you - I'll give that a go!

May I ask why the M instead of XL?

Obviously bigger != better but I don't know what the differences are.


These are dynamic quants, and they're basically just an indication of how far away from the desired quant it is allowed to go to achieve the goal. Generally, unsloth's toolchain moves quants up, rarely down.

* _0 and _1 do not use K quant and scales 32x32 blocks according to the original (B)F16 values; _0 scales the block using the original max and min values. _1 does this per row instead of per block.

* K quants do something similar, but now splits blocks into subblocks inside a superblock where the superblock has min/max scaling, but the subblocks also have scaling in the range of the superblock's scaling and are stored using less bits.

* K's M, L, XL are just how aggressively the subblocks and their scaling factors are chosen. Generally, it puts a max on how far you can deviate from the chosen quant to maintain the desired quality, but also gives them a bigger budget to perform that excursion in. XL most aggressively tries to preserve the intended quality, while S does the least.

* Dynamic quant on top of this scales entire layers, full of blocks, according to how much they effect various measurements (such as KLD and perplexity).

That said, there is no reason K_S is even produced by anyone, same with Q_0, Q_1, and I_NL. People should no longer be using those. M only is meaningful if you're trying to restrict the upper bounds: K_XL can reach BF16 for some weights, but rarely; people think this has a speed implication for hardware that has native 8bit in their tensor units (but it doesn't).

Unless you're specifically trying to cure a problem, stick with K_XL.


You seem to understand this stuff pretty well, any recommendations on resources (blogs, YouTube channels, whatever) for software engineers that want to keep up with this stuff on this kind of level?

A lot of the content about AI out there is kind of produced to the lowest common denominator. Basically a never ending scheme of get rich quick/passive income kinds of AI content.


Unsloth’s guides on getting various models running are great starting-off points for the “practicioner’s side” of things. Note that they include settings for llama-cpp, ollama, and other runtimes in addition to their own “unsloth studio” (their product seems like overkill imo)

If you’re curious about what a particular switch does, clone the llama-cpp repository to your computer and try asking your favorite pet rock prompts like “This is llama-cpp. Can you look at what the -ctk parameter does and explain to me?” Giving Claude/codex/whatever access to the actual code goes a long way, but it is just one opinion.

If you’d like to learn how transformer-based language modeling works in detail, I suggest starting with chapter 0 or 1 of https://arena-chapter0-fundamentals.streamlit.app/ depending on your skill level, then use that to work your way to reading research papers.

Graduate students who study these topics are generally as annoyed by the “get rich quick” style of advertising as you are, so the deeper you go toward academic research the quieter those voices tend to get, mercifully. That said, this is balanced by the unfortunate fact that top labs have strong posturing signals they try to send, so it can be hard to see which preprints actually have good ideas, which are trying to promote their group’s tech instead of doing science out of curiosity, and which have authors who’ve innocently deluded themselves into overfitting their own pet projects. Read widely but adversarially, test everything but hold fast to the good stuff, etc etc


Hey some of us are on hardware (gfx906 based Radeon MI50s with 32GB of stupidly fast VRAM and basically no compute) that inference significantly faster with Q_0 and Q_1 quants

Vega... unfortunately kinda sucks.

Its not amazing at compute (yet is a member of the GCN family, which I have been a fan of since its inception) and ended up being too expensive for perf/$ and perf/watt.

The only thing it did was make Nvidia rush Series 10 out the door and make it too good. Nvidia has been unable to live up to the gen-to-gen uplift Series 10 did, all because AMD made Nvidia blink.

Basically, you're 2 gens too early. CDNA2/gfx90a is the minimum you need to get any meaningful performance out of inference, or maybe CDNA1/gfx908 if you really don't need to quantize at all.

BTW, I did suggest this elsewhere in this HN story, but have you tried just disabling KV quant entirely? That is a huge speed uplift for compute-poor users.

Also, llama.cpp's support for gfx906 is probably never going to as good as it is for other cards, and good ROCm support for cards before they rebooted the driver/stack team is probably never going to materialize. I don't see the point in hanging onto them.

Like, if I was in your place, replacing it with even a 9060xt, with half the RAM, would be a step up. They go for $450. People have been building dedicated inference machines with these and they've been amazing, just throwing in 3 or 4 in, and scaling VRAM to meet needs.


I'd have to try the KV cache trick but folks get pretty competitive speeds with the current 31B/27B dense models e.g. https://www.reddit.com/r/LocalLLaMA/comments/1tc9j6u/mi50s_q...

If your hardware fits K_M but not K_XL, should you prefer going down to a lower quantization’s XL or sticking to the higher quant’s Q_M?

The correct answer should be "try it!"

But as models are starting to pack more information into less bits, some weights are just going to end up becoming super important and very sensitive to quant. So, I'd just move down a Q size, and continue with K_XL. Like, I'm betting Q3_K_XL will beat Q4_K_M on any given model in real world testing, even though its ~20% smaller, but perform worse on benchmaxxing.

The only exception I could think of is quantizing small models, like, my testing on Gemma E2B/E4B and Qwen 3.5 9B, quantizing at all was super noticeable... they can't spread the error across more weights.

Good news (at least for me), 24GB of VRAM is enough to store either of those in BF16 and then a ton of room for F16/F16 KV cache.


MTP recommended

on my M1 Max, MTP consistently lowers my performance! I’ve tried both llama-cpp’s recently landed MTP support (cloned and built Tuesday) as well as one of the other forks a few weeks ago. Suspect nobody’s done a comparison on hardware like mine.

I recommend sticking with the dense models for both Qwen and Gemma.

On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.

By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.

Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.

https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/

Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.

I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.

Edit: Oh, and since I mentioned Gemma 4, my testing mirrors my Qwen 3.5/3.6 experiences, 26B-A4B performs worse than 31B, but is also way faster. llama.cpp doesn't support Gemma 4's MTP style yet, so both could get even faster.


    I tried the qwen3.6-27b Q6_k GUFF in llama.cpp 
    and LM Studio on my M2 MacBook Pro 32GB machine 
    last week, and I barely get a token a second with either.
The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.

(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)

Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.


You are using Q6 6 bit quantization; on my 32G MacMini I use Q4 and it is faster but when I use it with OpenCode, I set up a task and go outside to walk for ten minutes. Smart, capable, and slow. Still, I love using local models.

EDIT: I run with context wired at 64K


The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.

For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:

Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).

Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.


Have you tried enabling MTP? Those numbers are similar to what I was getting on my Strix Halo box, but configuring/enabling MTP doubled the TG speed of the 27B model (18-20 t/s now).

Thanks - I’m in the process. I’ve tried briefly, but so far it appears marginally slower. (Noting that llama-bench doesn’t support MTP yet so you’re reduced to running different prompts and eyeballing the log.)

So I’m assuming I’ve done something wrong along the way, but I’ve not had time yet to explore it.


Thanks for the info.

27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster.

Thanks!

Check out Unsloth Studio it provides MTP support now which 2x the token generation speed with no loss of accuracy: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

I get 150t/s peak, 120t/s avg with Qwen3.6 27B Q4 with a 4090 on Linux. Now that MTP has landed into llama.cpp.

> qwen3.6-27b Q6_k

That's the dense model, you probably want a mixture-of-experts (MoE) one.

Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF


Thanks!

My token throughput is much better using vLLM-mlx on my M2 ultra than llama.cpp. It might be worth a shot to give it a try.

you should be using dflash with that model, look it up

Yep, same here and agree.

Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.

From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.


Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.

Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.


That's an ABI constraint of the x86 32-bit API.

People invented x32 to fix this. Or just use amd64.


This was with amd64.

ICC was at the time the only compiler that would not do that.


Mainly by having view-dependent (i.e. changes with the camera angle) material reflectance (diffuse colour and specular highlight).

i.e. the colour (and possibly other surface properties) vary depending on their direction, which is (or at least can be) encoded spherically (as spherical harmonics).

The width/size of each point/splat is also not just a radius, it can be anisotropic, and have an orientation in space, so again, it can vary its size depending on orientation when rendered.

It has been mildly amusing watching the AI crowd learn about point clouds though, and use things the VFX industry was using in the early 00s (spherical harmonic encoded materials - we had light-dependent as well for relighting - points with direction and anisotropic widths, etc)...


> spherical harmonic encoded materials

This in particular has been hilarious for the exact reason you mentioned. For anybody curious, here's a paper from 2008 about this technique:

https://www.ppsloan.org/publications/StupidSH36.pdf


Ah, so 3DGS is a Neural method?


There is a neural method for computing 3DGS from video or a series of photographs. Rendering 3DGS uses no neural networks as far as I know.


Its not a neural method. Its just differential rendering and backprop


creating a 3DGS scene does require machine learning, but no, technically neural networks are not involved!


> 4. Allowing you to bypass geo-restrictions on certain content.

In theory, but as someone who uses Mullvad in the UK on a day-to-day basis on my personal laptops (not my phone) - I'm using it now, I'm afraid there's quite an additional downside I've found, in that because Mullvad's (at least UK, but also French and Dutch ones I've tried) exit IPs are known, many companies (Cloudflare, Akamai) at the very least know about them, and several sites block access when using Mullvad, returning 403s.

Santander bank for example, I can't always (sometimes I can) connect to when using Mullvad, and sometimes have to turn it off, as I get 403 responses from the bank otherwise (using Firefox).

Sometimes using IPv6 in the Mullvad settings gets around this, but more and more recently I've found it doesn't, so there sites where I'm having to stop using Mullvad to actually access sites.

(I'm still a happy customer, and 1 to 3 are still true and why I use it otherwise).


>Santander bank for example, I can't always (sometimes I can) connect to when using Mullvad, and sometimes have to turn it off, as I get 403 responses from the bank otherwise

Rotating your VPN endpoint will resolve the issue. It might take two or three tries.


What some people are doing instead are using proxy vendors that have millions of IPs around the world including residential.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: