It's much harder to RL out design taste because it's not self-grounding, and human labelers have no real skin in the game, so this (having a human with a vested outcome in the process directing a model's work) is the best way to get LLMs better at design/"taste"/aesthetic judgment themselves. We were working on the same thing 7 months ago and then I realized that winning over designers to do this would be a huge uphill battle setting up an inevitable fall from grace later on.
What makes me most suspicious of Claude Design is that when you disconnect and reconnect later, it loses context and nags you that the product doesn't work like that. Bullshit. It's at best an anti-abuse/implementation detail (to keep you from launching 10 at once and coming back to them later) or product shortcoming that just so happens to be optimized for keeping you from continuing your design in better tools than theirs for the inevitable followups.
It's great for one shots and it makes sense when you're trying to build a vertical product development stack like Anthropic but I'm disappointed it feels more like a tool optimized for keeping you in their product than for what you're working on. If a company other than Anthropic had shipped this - it's not that hard to build a visual self-eval loop, just use Chrome Devtools Protocol to run headless chrome and take screenshots -> feed into a judge LLM for feedback -> continue - I don't think it would really have seen much adoption.
That said, AI trained on Actor-Critic with a tight human feedback loop definitely seems like the right approach to solving the problem, just not something I want to spend my time training for someone else unless I can do so with higher "entropy" ie high parallelism/optionality
Where does the article mention Claude Design? It seems to me the author is using LLMs as a tool for iteration, given he is a designer.
Also, you're mentioning a lot of unrelated tech. DPO, PPO, actor-critic, visual self-eval loops, Anthropic's "vertical product development stack" may be interesting, but they are mostly orthogonal. The article's point is simply that a designer can now turn design proposals into working prototypes faster than with Figma.
Also, you mention what seems to be a random product bug about disconnect and reconnect that doesn't have anything to do with this workflow. It seems to me that you're post-rationalising some insights that are not really there.
Good to think things through and in public, not discouraging it. I hope this reads as constructive.
It’s not cheaper to run Claude in your own GPUs rather than the $200/mo for certain workloads. For a large portion of what I work on, the bottleneck is my time, not tokens. You certainly could throw more tokens at it but if you need it to work a certain way for certain reasons, and your plan/goals are beyond the scope of what the top-capability models can do, then throwing them at the problem just bogs you down in extra cruft or reviews/iteration that you could more effectively do being the primary driver of the work.
Sure, you can keep paying $200/mo to Anthropic forever, and accept heavy censorship on the types of tasks you can do (e.g. malware research), accept no privacy, and accept rate limiting and the requirement of internet access at all times.
Or buy $2400 of GPU today to get you something close to get you within 10% of Opus 4.6 on coding benchmarks, that pays for itself in 1 year, AND you can work with private code and data offline as you like with no censorship or restrictions.
The value proposition of Anthropic is comically bad to anyone that understands how to insert PCI-E cards into a motherboard and install linux.
This is what I’ve been doing. I’m not even against external funding, I just see it as instrumental to the ultimate goal of building a sustainable business. Venture capital is basically a super high interest loan so it’s only something you want to take when and where it can be effectively deployed.
Most other founders/business owners and investors I know don’t see that as a controversial statement (that it makes no sense to see access to capital through a scarcity mindset, when it is factually accessible) but because most customers or potential employers aren’t one of those, it’s been a problem because this isn’t what you’re “supposed to do” and so they read into it from a social/legitimacy angle.
Regardless, it’s quite rewarding IMO and I highly suggest it to people with the means to pursue that path. I don’t see why people get so worked up on the whole VC thing, at the end of the day it’s just lending. Having dabbled in angel investing on the end you get a lot of people lining up for what they clearly perceive as unsecured loans or a social signal and on the other so many people get caught up in the dynamic of dangling said unsecured loans (when I first started it felt like some of the investors reaching out to me were just doing it to boss me around or something, like sir you can clearly see I started this two months ago and you dm’d me on LinkedIn to chat, I don’t need your money).
IMO most good founders/investors are credibility-maxxing but because of the social dynamics and moral hazards inherent with spending OPM you get weird other behavior
Thanks for sharing this. This is the same as what I am thinking. I still see the gravitational pull of doing what seems like the supposed thing to do, but it makes sense to do what the model in our mind suggests makes the most sense, especially when reality is rapidly evolving and historical information may be outdated.
There are basically two tiers of "Chinese models" in this context, the "edge" sized ones with ~30B parameters or less, and the big ~1T models that can basically only run in the datacenter.
I don't think it's as simple as saying China's hosting is subsidized, they have generally cheaper electricity and labor costs than in the US and don't have access to the top tier models, and a large internal market where the big models are the best thing they can run with what they have. So obviously they max out on their top models (which are trained with their hardware market in mind, not ours) and get the economy of scale from that, and can run generally the same hardware for less money than in the US because
The edge models are very cheap to run and can do so on inexpensive hardware. They are like 95% cheaper to run than Haiku, so the math is in their favor for certain batch workloads. Most people just run the models for themselves when they do that without making it available on openrouter or whatever, because you can just provision a gpu node and use it as needed, and it's not that expensive to run this family of models.
Is your problem that you want to call Chinese models hosted in the US because you're worried about the data handling?
I obviously don't know the full economics of the Chinese-hosted models, but estimates[1] put the cost of hardware (servers + networking) at 70-80% of the total cost. Those things aren't meaningfully cheaper in China, so serving DeepSeek at 1/3 the cost of the cheapest US provider doesn't really compute unless it's heavily subsidized or we believe that Chinese engineers are just that much better at optimization.
Edge models, yes, they can be convenient to run batch jobs locally. I still would argue there's no economic benefit over paying for models. Haiku has a bad price/perf but others in that class are significantly cheaper in hosted APIs.
Doesn't matter what I think, the reality is that the majority of enterprises (where the real $ comes from) will not consider sending their data to China.
Hardware is arbitrarily priced, with the floor being as little money as it costs to make it, and the ceiling being how much competitors are willing to pay for it - the latter is much more of the driver of current pricing in the West than in China.
In a free market, the country would not matter, but Chinese models are often running on domestic hardware which does not directly compete with Nvidia GPUs and thus they can't get away charging as much for it.
I pretty strongly feel the opposite way. Granted I have not used deepseek enough to “know” their model idiosyncrasies as well as Anthropic, so there is a partial skill issue. But I just find it really hard to justify using a less powerful model while I work.
The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.
That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?
The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.
Not even SotA models are good enough to generate code (beyond functions or small, very simple modules) that I'd be happy shipping, so I've decided to just not have them do that. And given this, it has basically turned out that what's left is information gathering + analysis + design overview stuff.
I've just recently started trying out DeepSeek 4 Flash and I was very skeptical at first because I've had some really good experiences with GPT-5.{4,5}, and couldn't possibly believe that this model they charge nothing for could give me similar results, but it absolutely shreds through things and ends up giving me very good answers in almost no time. I also like that it doesn't really seem to have much personality, it's given me mostly just facts and data so far without any additions to the prompt by me.
In my own agent I also specifically prompt to remove flowery language, snark, etc., but I haven't tried it with models like GPT-5.x which I've found has too much personality and tries to make it seem like I'm talking to a human too much.
I feel similarly. I'll gladly pay to use the most intelligent model I can find on the best harness I have. Sometimes this is GPT Pro, sometimes this is Opus.
I ask AI a lot of questions, not only about code but about my personal life, and I would be willing to pay very large sums to have the best quality output.
I think that's true for now, but eventually there will reach a point where a model is good enough (approaching that right now with frontier models) and there will be diminishing returns. I don't need a PHD level Genius to build me an analytics dashboard for example, so why would I pay for a model with that level of intelligence when I can (eventually) self host a good enough model and run queries for electricity cost + hardware.
I think we are approaching that now, with correct expectations. With frontier large models you can often one-shot tasks with vague prompts for stuff like creating CRUD APIs and dashboards around a simple data model since it's such a solved-problem now. With something like Qwen3.6 27B or 35B-A3B and a Strix Halo level computer or a MBP with 32GB or more or RAM, you may need to be more explicit and stay involved and be a little more patient, but you can absolutely get work done with it or delegate tasks to it successfully.
My Framework Desktop does a lot of similar work as my Claude subscription at work (Cowork, chats) for 100W of power draw and a little patience waiting for a slow GPU with limited memory bandwidth to crunch the numbers. Agentic coding is obviously weaker but CRUD development and visualization dashboards are within reach, and I'm usually pleasantly surprised at its ability to self-manage devops.
I agree. My company pays for my tokens so I use the best models I can. I'm more worried about the quality of the work and the speed of accomplishing tasks than I am on saving the most money on every token.
Now, if they come back and tell me I can't spend as much om tokens, I'll have to change my strategy. But everything I'm hearing so far is we're going to be increasing our token spend this year and probably next year too. Not crazy increases but maybe enough to still keep using the latest models without being anxious about every prompt.
It's through my startup, so both I guess. Generally I find my bottleneck to be attention and focus, and the opportunity cost of not going back to work at my prior employers absolutely dwarfs the amount of money I spend on tools, so it's not hard for me to justify spending $200/mo on something I use every day that makes me more productive and generally removes bullshit from my life.
At my prior job there was still what felt like a strong enough correlation between my actual performance and my pay that I don't think I would have had a hard time justifying the expense there either; now I absolutely don't. With the current state of the models, it's baffling to me to hear about professional software developers planning their work around their $20/mo subscription's quotas.
Obviously it's more complicated than more tokens = more productive, but I see them less like SaaS and more like gasoline, where if I run out or need more to do what I'm doing, as long as I'm not being wasteful, I just buy more. Why would I waste a day walking 30 miles by foot when I can just pay $5 for gasoline and drive?
I do that for personal use too (although $2.4k/yr for me because I only have an Claude Max subscription). Outside of my hobby projects Opus also manages my personal accounting, researches and organizes info (travel plan, what to buy and where to buy, etc), helps me reply to emails when I'm working in the kitchen, etc. I consider it well worth the price. Tbh I'm willing to pay more than what I currently do, but competition is good for the consumers.
A formative moment for me was reading Richard Stallman's writing on the GNU website and seeing him quote [0] Rabbi Hillel [1]:
"If I am not for myself, who will be for me? If I am only for myself, what am I? And if not now, when?"
This inspired me to seek out more about Rabbinic Judaism and its theology more deeply, and I found the language and analogies concerning the idea of "repairing the world" (which you referenced, but which I think at first glance aren't necessarily something most people would identify as a specific core doctrinal theme) particularly inspiring [2]. To me it's frankly beautiful and something I recommend anybody interested in metaphysics or ethics/morality looking into; it also ties into the Kabbalah. IMO this aspect of Jewish theology deserves to be more widely known because it's something all of us can learn from.
I think there is a reasonable basis for taking a gamble that small models capable of fitting on a 32GB card will continue to advance over the next 5 years and eventually approach Gemini Flash 3.5 / Sonnet 4.6 levels of capabilities, which I would consider to be past the threshold of “probably worth the cost and hassle of running 24/7” if the upfront cost of the hardware was palatable.
My use case would primarily be in search, integration, and indexing other software projects with my own, as well as transcription/indexing of interesting video and audio content (eg Dwarkesh interviews) that I don’t have time to watch but want to easily search and apply to my projects, and search/indexing for useful information from things like Linux kernel and security mailing lists. Basically there is a lot of stuff that, if the cost were low enough, I would point a reasonably intelligent AI at to distill out useful information and apply it to my projects, or just cherry pick the interesting things out and surface them to me so I don’t have to wade through all the mundane stuff and man-made slop getting in the way.
>My use case would primarily be in search, integration, and indexing other software projects with my own, as well as transcription/indexing of interesting video and audio content (eg Dwarkesh interviews) that I don’t have time to watch but want to easily search and apply to my projects, and search/indexing for useful information from things like Linux kernel and security mailing lists. Basically there is a lot of stuff that, if the cost were low enough, I would point a reasonably intelligent AI at to distill out useful information and apply it to my projects, or just cherry pick the interesting things out and surface them to me so I don’t have to wade through all the mundane stuff and man-made slop getting in the way.
All of that feels like something that a $20 chatgpt pro subscription is for, maybe with slightly better tool use capabilities. There's no way that a $4000 purchase on a GPU would ever be worth it if all you're doing is running a handful of queries per day.
It would require much more than a couple of queries per day, I want to basically do bulk ingestion and search/evaluation/integration across tens of thousands of videos and software projects (if it were cheap enough and smart enough). It would basically be setting up and operating a pretty large data ingestion and coding agent pipeline, which I would want to itself be mostly automated.
It’s ok if you don’t want to do the same kind of thing but I find it weird how dismissive so many people get about wanting to use LLMs for large projects, or how anybody who says they’re using them for these kinds of things (I’m doing similar for other stuff) gets challenged on what they’re doing it for.
In the long run cloud gaming is inevitable, it’s just more economically efficient for the cost of the hardware required to render graphics to be amortized across consumers and not sit idle when being unused by collocating them with game assets in POPs.
Once enough gaming compute runs at the edge it also allows for more technically advanced games than would currently be economically feasible (but aren’t made mostly for lack of a market/adoption of cloud gaming and the resulting lack of technical know-how). So I think it will stick and probably end up winning over the holdouts, once the cost of rendering the games they want to play with consumer hardware becomes too large to stomach.
You could make the same economic argument for any SaaS, but the margins SaaS providers look for make it so that the only time it isn't cheaper to run your own software/hardware stack in place of SaaS is when the hardware requirements are very low, not high. SaaS makes sense economically when you take into account the admin, compliance, etc. costs... and the admin costs of a Nintendo Switch are pretty low.
Economic efficiency does not win the day because the free market is a myth. Cloud gaming is a technically worse solution because the latency floor is higher. It's a microeconomic disaster (rent vs buy, buy wins). The only reason it would become a thing is if the multinationals succeed in concentrating more wealth and power, which consumers aren't interested in supporting. It's a bad deal and consumers know it. They would have to be forced into it by having the consumer hardware market taken off the table (which is happening and the only possible avenue for a technical regression like cloud gaming to have a market).
reply