cj00's comments

cj00 · 2026-04-06T02:19:13 1775441953

Yeah searching your history is so terrible too I ended up making a custom database that takes the also horrible Takeout output and parses it into a SQLite db. I end up relying on it when I remember some video I started watching weeks ago but can’t remember where it was anymore.

owlboy · 2026-04-06T02:55:15 1775444115

Do you automate your takeout so the DB is relatively fresh?

cj00 · 2026-04-01T01:44:26 1775007866

/buddy is live and I got a different result than in this app.

fatcullen · 2026-04-01T03:44:02 1775015042

Huh weird, they must have changed the algorithm up due to the leaks. Would be pretty easy, there's a constant seed variable so they'd just need to change that, figured they might. Too bad, sorry this didn't work out

dtran · 2026-04-01T04:14:09 1775016849

Ah yea, I got an uncommon dragon instead of the rare duck. Did you get your legendary?

fatcullen · 2026-04-01T21:18:23 1775078303

Nope :| Still a ghost, but it's just common now, too bad

cj00 · 2026-03-23T15:08:42 1774278522

It’s 400B but it’s mixture of experts so how many are active at any time?

simonw · 2026-03-23T15:10:04 1774278604

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

thecopy · 2026-03-23T17:23:32 1774286612

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

Aurornis · 2026-03-23T18:16:17 1774289777

Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

kgeist · 2026-03-23T20:16:56 1774297016

>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.

Aurornis · 2026-03-23T20:24:10 1774297450

I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.

Nobody actually quantizes every layer to Q4 in a Q4 quant.

freedomben · 2026-03-23T18:36:31 1774290991

I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.

zozbot234 · 2026-03-23T18:43:13 1774291393

Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.

anemll · 2026-03-23T18:39:39 1774291179

Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

jnovek · 2026-03-23T18:03:50 1774289030

I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.

stingraycharles · 2026-03-24T06:29:04 1774333744

One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.

zozbot234 · 2026-03-24T06:57:19 1774335439

I don't think this is correct, "active parameters" is quite unambiguous in that it means a sum of all active experts plus shared parameters.

fouc · 2026-03-24T11:41:49 1774352509

looks like they meant “effective dense size” which is the square root of total params×active params, so in this case sqrt(397 x 17) = ~82

zozbot234 · 2026-03-24T13:56:58 1774360618

But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.

stingraycharles · 2026-03-24T17:43:20 1774374200

Ok I am by no means an expert on this and I immediately stand corrected. But as I understand it, in order to understand the amount of active memory that’s required, it’s more accurate to go by the ~82B number, right?

zozbot234 · 2026-03-24T18:36:09 1774377369

The ~82B figure is an attempt to compare performance to an equivalent dense model. The amount of active parameters is given by the ~17B.

Hasslequest · 2026-03-23T17:59:45 1774288785

Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

anshumankmr · 2026-03-23T16:24:24 1774283064

Aren't most companies doing MoE at this point?

cj00 · 2025-10-20T17:23:35 1760981015

Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.