Hacker Newsnew | past | comments | ask | show | jobs | submit | dgreensp's commentslogin

Where do you see information about the efficiency gains over AV1?

It's at the end on the conclusion slide @24:40:

https://www.youtube.com/watch?v=Se8E_SUlU3w&t=1480


This reveals a staggering level of incompetence, if that’s really all it is, and lack of transparency.

They don’t have ANY product-level quality tests that picked this up? Many users did their own tests and published them. It’s not hard. And these users’ complaints were initially dismissed.

I don’t think the high vs medium change is really on par with the others. That’s a setting you change in the UI, and depending on what you are doing, both effort levels are pretty capable, they just operate a bit differently. Unless I’m missing something and they are saying they were doing some kind of routing behind the scenes.

If they are constantly pushing major changes to the prompts and workings of the tool, without communicating about it, and without testing, it’s likely there are other bugs and quality-degrading changes beyond the ones in this article, which would make a lot of sense.


> If they are constantly pushing major changes to the prompts and workings of the tool, without communicating about it

These are all classic symptoms of vibe-induced AI velocitis, sold by AI-peddlers as the future of the industry under the guise of "productivity."

AI can help one generate a lot of code, but the poor engineers approving the deluge of changes are still using their old, unmodified, stock meat-brains. An individual change may look fine in isolation, but when it's interacting with hundreds or thousands of other changes landing the same week , things can go south quickly.

Expect more instability until users rebel, and/or CTOs amd CIOs cry uncle. Amazon reportedly internally sounded the alarm after a couple of AI-tool-induced SEVs. The challenges at Github and the company insisting you don't call it Microslop are also rumored to be AI-related.


About 20 years ago I maintained a shop floor control client/server application. I asked my manager why we didn't have any independent Q/A. He said we didn't need any testers because we have 500 in the building.

Wild west days then.

Looks like we are back.


It is worse than that. People have been complaining for weeks and Anthropic’s message was basically “you are holding it wrong”. On top of that this misconfiguration somehow makes CC consume much more tokens. How believable is all that?


Back implies we ever left.


Time is finite and regression testing always gets punted to the back of the line when humans are excited. This simply reveals a staggering level of humanity.


Software engineering is not a new field. Best practices on testing are mature now, and Anthropic has poached enough engineers from companies with a solid understanding of those practices.

Yet, their flagship product got three really bad changes shipped into it and only resolved after more than a month.

This raises another question: with all the industry-wide boasting about AI-driven productivity, why does the leading company in agentic coding take over a month to fix severe customer-reported issues?


> Why does it take the company that is probably the best at agentic coding more than a month to find and solve such large regressions, even with customers complaining about them?

My unfounded suspicion: because this is the tradeoff we're all facing and for the most part refusing to accept when transitioning over to LLM-driven coding. This is exactly how we're being trained to work by the strengths and limitations of this new technology.

We used to depend on maintaining a global if incomplete understanding of a whole system. That enabled us to know at a glance whether specs and tests and actual behavior made sense and guided our thinking, enabling us to know what to look at. With agentic coding, the brutal truth is that this is now a much less "efficient" approach and we'll ship more features per day by letting that go and relying on external signs of behavior like test suites and an agent's analysis with respect to a spec. It enables accomplishing lots of things we wouldn't have done before, often simply because it would be too much friction to integrate it properly -- write tests, check performance, adjust the conceptual understanding to minimize added complexity, whatever.

So in order to be effective with these new tools, we're naturally trained to let go of many of the things we formerly depended on to keep quality up. Mistakes that would have formerly been evidence of stupidity or laziness are now the price to pay for accelerate productivity, and they're traded off against the "mistakes" that we formerly made that were less visible, often because they were in the form of opportunity cost.

Simple example: say you're writing a simple CLI in Python. Formerly, you might take in a fixed sequence of positional arguments, or even if you did use argparse, you might not bother writing help strings for each one. Now because it's no harder, the command-line processing will be complete and flexible and the full `--help` message will cover everything. Instead, you might have a `--cache-dir=DIR` option that doesn't actually do anything because you didn't write a test for it and there's no visible behavioral change other than worse performance.

Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy. We're being trained away from that. There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.


> you might have a `--cache-dir=DIR` option that doesn't actually do anything

Working in enterprise software it's surprising how long an option that doesn't actually do anything can be missed. And that was before AI and having thousands of customers use it.

This same problem happens with documentation all the time. You end up with paragraphs or examples that simply don't reflect what the product actually does.


Where I work, options that don't do anything are seen as good engineering practice. You see, you can't break your user's scripts. Your CLI arguments are part of your stable API. If your tool used to have a cache_dir CLI option, and now no longer needs it, you still have to keep accepting cache_dir and treat it as a no-op until you are confident your users have migrated away from it.


I've been working on this problem coming from the program synthesis school of thought over at https://promptless.ai (which you would have no clue just from looking at the website because its targeted at tech writers).

I'm quite fond of the idea of incremental mutation of agent trajectories to move/embody some of the reasoning steps from LLM tokens into a program. Imagine you have a long agent transcript/trajectory and you have a magic want to replace a run of messages with "and now I'll call this script which gives me exactly the information I need," then seeing if the rewritten trajectory is stable.

To give credit where it's due, it's an overly complicated restatement of what Manny Silva has been saying with docs-as-tests https://www.docsastests.com/. Once you describe some user flow to humans (your "docs"), you can "compile" or translate part or all of those steps into deterministic test programs that perform and validate state transitions. Ideally you compile an agent trajectory all the way.

So: working with coding agents, you've cranked up the defect rate in exchange for speed, lets try testing all important flows. The first thing you try is: ok, I've got these user guides, I guess I'll have the agent follow along and try do it. And that works! But it's a little expensive and slow.

So I go, ok I'll have the agent do it once, and if it finds a trajectory through a product that works, we can reflect on that transcript and make some helper scripts to automate some or all of those state transitions, then store these next to our docs.

And then you say, ok if I ship a product change, can I have my coding agent update those testing scripts to save the expense and time of re-running the original follow-along. Also an obvious thing to do, and you can totally build it yourself with Claude Code in a github action. But I think there is a lot of complexity in how you go about doing this, what kind of incremental computation you can do to keep the LLM costs of all this under a couple hundred bucks a month for teams shipping 20 changes a day with 200 pages of docs.

The most polished open source "compiler/translator" I've seen exploring these ideas so far is Doc Detective (https://doc-detective.com) by Manny.


I am not sure this approach can take you very far.

In my experience, CC makes it very very easy to _add_ things, resulting in much more code / features.

CC can obviously read/understand a codebase much faster than we do, but this also has a limit (how much context we can feed into it) - I think your approch is in essence a bet that future models' ability to read/understand code (size of context) improves as fast or faster than the current models' ability to create new code.


Ouch. I guess this came across as "my approach". I haven't done enough agentic coding to feel like I know enough to have a worthwhile, but at the moment I'm squarely in your camp. I don't believe it's going to work to let an agent loose expanding a teetering codebase with little to no concern for maintainability. We're going to have to painfully relearn the lessons of pre-AI coding, whatever that means with AI in the mix.


> Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy.

I don't even use Claude and it has been rather clear to me, that their service has not been working properly for some time now.


  > digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent
not to sound uncharitable but this seems like the absolute worst way to run a business; your customers are basically lab rats... why should they pay for anything in this scenario?


I just said someone's gonna build it, not that it's a good idea!

To be fair [to myself], this is scale-dependent. I work on a product with hundreds of millions of users. We're not going to be reading and pondering every bit of feedback we get. We have automation for stripping out some of the noise (eg the number of crash reports we get from bit flips due to faulty RAM is quite significant at this scale). We have lines of defense set up to screen things down -- though if you file a well-researched and documented bug, we'll pay attention. (We won't necessarily do what you want, but we'll pay attention.)

When I worked at a much smaller and earlier stage company, we begged our users for feedback. We begged potential users for feedback. We implemented some things purely to try to get someone excited enough that they would be motivated to give feedback.

Anthropic, OpenAI, Google? They have a lot of users.

Also, this automation would be in addition to the other channels by which you'd pay attention to feedback.

Also also, the ship has sailed. We're all lab rats now. We're randomly chosen to be A/B tested on. We are upgraded early as part of a staged rollout. We're region-locked. Geocoded. Tracked as part of the cohort that has bought formula or diapers recently. Maybe we live in the worst of all possible worlds?


>There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.

This sounds like Enterpret.


My theory is that most problem solvers are bad at solving problems, and most managers are bad at managing, and it doesn't matter how evolution created them: They'll make mistakes, they'll have finite time and energy, a finite context window, they'll lie and internally rewrite their own internal narratives as needed, and forget things, and drop balls, and they'll go in circles trying to find a bug they created but are too close to be able to see, and they're going to need a lot of external tooling to get through the day without forgetting anything, and constant reminders from others to get shit done. And this dynamic fundamentally creates peaks and valleys in productivity.

Wait, were we talking about humans or AI?

...

Everyone seems to be assuming either the humans or the AI has to be special. What if neither are?


models are great but models don't magically fix things. you need to set up systems to handle the output of code, you need to instrument metrics to llm to listen to and flag. experimentation is a huge problem, with the huge output of code, how to you keep your business metrics clean and isolate issues. these are all hard challenges.

in response, most companies are explicitly trading velocity for quality, and finding out that quality is actually important at the end of the day. if you look at the roadmap it's just ship ship ship. eng is being told to 3x their output. quality in the llm coded world is tough and there's not much appetite for it right now.


> This simply reveals a staggering level of humanity.

Pretty embarrassing for an AI company. Surely AI should be doing their regression testing?


> This simply reveals a staggering level of humanity.

Wasn't AI supposed to solve all the drudgery? All those humans aided by cutting edge AI are still failing at these basic tasks? Then how good is that AI in the first place?


No, AI wasn't supposed to solve all that drudgery. The hypothesized AI singularity would, but an ordinary AI agent running an LLM is just a problem solving automaton with no will of its own, just like a fleshy brain solving computer problems is just a code monkey.


I would think that many of these defects should show up clearly in service-side analytics as well. For example, the bug that repeatedly re-cleared thinking for old sessions would cause a substantial drop in token cache hit rate for sessions > 1hr for the affected claude code versions. Session age & claude code version seeeem like obvious dimensions for analytics. But perhaps only in hindsight.


They say that they did test but the coverage was not enough to pick it up, at least for the prompt change:

“ After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16.

As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.”

Considering the number and scope of users they serve, I can sympathize with the difficulty. However, they should reimburse affected users at least partially instead of just announcing “our bad, sorry “. That would reduce the frustration.


Naively, one could assume that with AI it should be possible to create a long and broad list of test cases…


To me it reads more like they are struggling to scale with requests and are trying to find ways that hurt users the least.


You’re talking about their intentions. OP is talking about how they don’t test continuously / densely enough for quality. I think both can be true.


To give my best guess, I think that the change of default effort is unrelated to the major problems encountered by the users but that this was added big and first to cover up a little bit the huge failure of the 2 other ones.

First thing you will read and that takes a big part is that it was something like: not really a bug but we changed a default not well communicated and users (their fault) did not notice it. This is why they were "under the false impression" of a change.

Lots of people will stop reading after a few paragraphs.


Eh :) Let's not forget the humans on the other end of this.

One of them was a bug that didn't present itself until after an hour of usage.


Seems like that would be trivial to test?


Most bugs are trivial to test for after you know about them.


True, but when your cache configuration has exactly 2 TTLs and modalities, I don't think it's offbase to expect them to test what happens in the cache hit/miss scenarios for each of those.

(I write this as someone who likes Claude Code, if that matters.)


Their in house philosopher thinks Claude gets anxiety though


There were few systems like claude in the past, to testing rulebook is not really written yet. And far from obvious.


LLM evals are well established, are these not applicable here?


I find the irreverent tone refreshing, personally.

As a founder who built all my prototypes and side projects on Deno for two years, I personally think Deno’s execution was just horrible, and avoidably so. Head-scratchingly, bafflingly bad decision-making.

I was the first engineering hire at Meteor (2012-2016), and we made the mistake of thinking we could reinvent the whole app development ecosystem, and make money at it, so I have the benefit of that experience, but it is not really rocket science or some insight that I wouldn’t expect Ryan Dahl and team to have, in the 2020s.

They were stretched thin with too many projects, which they were always neglecting or rewriting, without a solid business case. They coupled together runtime, framework, linting, docs, hosting, and packaging, with almost all of these components being inferior to the usual tools. The package system became an absolute nightmare.

If the goal was to eventually replace Node and NPM with something where TypeScript was first-class, there was better security, etc, they could have done a classic “embrace and extend.”


Pivoting to node support and even more-so rewriting deploy really hurt momentum on top of all those projects. Coming out swinging with 2.0 and then decreasing regions and rewriting the product that makes you money soon after was certainly a choice.

While they’ve been doing that void 0 made a significantly better linter & formatter that can replace eslint, a perfect embrace and extend. Nodes improved and a lot of annoyances have been ironed out (at a user level at least), each passing day the benefits of deno reducing.

Fresh is close to abandonware despite being a framework that could be the middle ground between htmx and js framework insanity with even 1 man-day a fortnight dedicated to it.

JSR seems like it’s going nowhere and only exists to install @std.

This was the final straw for me, if they bounce back in a few years hell yeah I’m in but I’m begrudgingly back to node for now.


I agree with all this.

I came to Deno because I needed a break from Node/NPM. I don’t agree with all of Node’s decisions (particularly the ES module debacle), but Node/NPM have improved over the years.

A big problem with JSR was no private packages. All your code has to be open source. But JSR is the only way to get constraint solving in Deno, besides using NPM.


This piece starts off making it sound like the computer is pretty much doing all the work, while the human maybe weighs in on a matter of taste once in a while, if they like, but by the end, the list of what the LLM can actually do is really short. Implementing a sorting algorithm for you, perhaps, but not necessarily one without “egregious flaws,” and really you should still use a library for that. Replacing high-quality libraries of mature software, that have tests, etc, is obviously one of the poorer uses of vibe-slop coding.

It comes down to “adding code” that attempts to, or seems to, achieve something.


I always interpreted cathedral vs bazaar as being about the architecture of large things. Do you build to a master plan? Or does everyone do whatever they want? (Within some kind of framework, of course.) Like the cathedral of the Java SDKs vs the flea market of NPM.

This author seems to have some kind of attitude about organization in general—anything with people and process, that happens to exist around some project, that might require at least a small commitment to be a part of. Like complaining that a flea market has a form to sign.

The ability for people to functionally collaborate, with some kind of structure, is the key thing that enables building large things together.


By that logic, Microsoft’s brand means nothing when OpenOffice is free.


Microsoft is a robust business, with corporate contracts going back 40 years. There are going to be exceptions and winners, and microsoft is probably a winner


Have you noticed how rapidly Microsoft is setting its own brands on fire? Most recently by abandoning the brand "Microsoft Office" and renaming that product to "Microsoft 365 Copilot App" instead.


I also cannot understand how they mix up their brands so much, even people working in the MS ecosystem need to learn new brands every year.


A curly brace is multiple tokens? Even in models trained to read and write code? Even if true, I’m not sure how much that matters, but if it does, it can be fixed.

Imagine saying existing human languages like English are “inefficient” for LLMs so we need to invent a new language. The whole thing LLMs are good at is producing output that resembles their training data, right?


I ran into this, and there was a bizarre fix—I think having Adobe apps open in the background caused it, or something.


I saw some responses like this. I have zero Adobe apps in my Mac.


Upvoted because educational, despite the AI-ness and clickbait.

I’ve worked at orgs that used Postgres in production, but I’ve never been the one responsible for tuning/maintenance. I never knew that Postgres doesn’t merge pages or have a minimum page occupancy. I would have thought it’s not technically a B-tree if it doesn’t.


This is some of the best writing I've read in a while, and truly fascinating.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: