I gave four local models a production A/B test to analyze — connect to Supabase, pull live experiment data, run Welch's t-test + chi-square, build charts, output a structured summary, and make a grounded ship/don't-ship recommendation.
Caveat: I don't really NEED an LLM to automate experiment analysis, nor do I think it's a good real-world LLM use case, but this was a very interesting test of complex multi-step tool calling and hallucination resistance over a long procedural task. In short, these tiny <35B parameter models are capable enough for such narrow agentic tasks.
Results on my M4 Pro MacBook (48GB):
- Qwen 3.6 35B A3B (MoE): 100/100 — perfect
- Qwen 3.6 27B MTP: 90/100 — wrong completion rates
- Qwen 3.5 9B: 90/100 — same error as the 27B
- Qwopus 3.5 9B Coder (fine-tune of Qwen 3.5 9B): 60/100 first run, 80/100 on rerun — same prompt, different mistakes
Some interesting learnings - even good models make the same mistakes a junior DS would make, and this is very specific to clickstream metric definitions where you need to decide if you need to use session-level or user-level data. And of course, the LLM experience is and has always been of non-determinism, so you can give them the same task multiple times and just get different results.
The post includes the full benchmark prompt, scoring methodology, and a link to the live workbench.
I tested a few local and cloud models by asking them to build the same visual HTML animations: cherry blossom tree, solar system, ocean sunset, and wildflower meadow.
This is not meant to be a rigorous benchmark, nor is it a new concept. But it's a very hard coding challenge on one hand which truly tests a model's capability, and it's very easy for humans to assess by looking at the results at a glance.
Qwen 3.6 27B (base + MTP) produced the best quality, but they are very slow on my 48GB M4 Pro Macbook. Qwen 3.6 35B A3B absolutely hits the sweet spot between speed & quality. Using Pi coding agent for a minimal harness and llama.cpp for the most efficient inference backend.
The post includes the prompts, comparison videos, a small benchmark workbench, and a live gallery of the outputs.
Running a local AI model is too intimidating for the regular person, and doesn't yet have the polish and "it just works" experience of ChatGPT or Claude. In this guide, I'm hoping to simplify as much as possible with a simple LM Studio setup. Newer small models such as Gemma 4 and Qwen 3.6 make it possible for the first time to get a useful setup going on a consumer laptop. This post is not for the advanced localmaxxing bros.
I've been testing local AI models on an M4 Pro with 48GB RAM for the past few weeks. Earlier in the year, small models that could run on my laptop felt like demos of Claude / Codex. The newer Gemma 4 and Qwen 3.6 releases are the first ones that felt useful enough for everyday research, coding assistance, and personal knowledge work.
The post is a practical snapshot of what changed for me: frontier AI pricing is going up, quality has been less predictable, and small local models are finally good enough to test seriously without buying GPUs.
It's time to test out what's "good enough" for personal use cases, so we can reduce reliance on high-cost, low-privacy options that we're all used to right now.
- simpler feature-based scoring using logistic regression instead of embeddings (something I understand better)
- external prompt datasets for broader validation
- a more transparent 2-axis system that seems to behave much better than the original
It runs locally and doesn't upload your prompt data anywhere. Point your agent at the repo to validate yourself.
Would especially love feedback from people who have worked on behavioral measurement, NLP evaluation, or human/AI interaction. I'm definitely not a domain expert. One of the main things I wanted to document here was the difference between "AI helped me ship a prototype fast" and "this is actually a sound measurement system."
The most interesting part is that the first version wasn’t just noisy — some of the personas were structurally meaningless because the axes were correlated and one outcome was unreachable. That feels like a good reminder that in measurement work, a simpler model with cleaner data can be more honest than a fancier one with embeddings.
This is fantastic! Thanks for sharing. Great step forward towards a pragmatic approach to truly leveraging agentic coding to increase productivity and not slop. Love this.
My workflow for building side projects and work tools with AI coding agents that actually survive past the first month. Covers model choices (Claude Opus, Codex, Qwen), a docs-first approach (PITCH/ARCHITECTURE/IMPLEMENTATION), guardrails, context management via slash commands, and what I stopped using (MCP servers, multi-agent teams, instruction files). Includes dogfooding results — what shipped and what broke.
Caveat: I don't really NEED an LLM to automate experiment analysis, nor do I think it's a good real-world LLM use case, but this was a very interesting test of complex multi-step tool calling and hallucination resistance over a long procedural task. In short, these tiny <35B parameter models are capable enough for such narrow agentic tasks.
Results on my M4 Pro MacBook (48GB): - Qwen 3.6 35B A3B (MoE): 100/100 — perfect - Qwen 3.6 27B MTP: 90/100 — wrong completion rates - Qwen 3.5 9B: 90/100 — same error as the 27B - Qwopus 3.5 9B Coder (fine-tune of Qwen 3.5 9B): 60/100 first run, 80/100 on rerun — same prompt, different mistakes
Some interesting learnings - even good models make the same mistakes a junior DS would make, and this is very specific to clickstream metric definitions where you need to decide if you need to use session-level or user-level data. And of course, the LLM experience is and has always been of non-determinism, so you can give them the same task multiple times and just get different results.
The post includes the full benchmark prompt, scoring methodology, and a link to the live workbench.