Hacker Newsnew | past | comments | ask | show | jobs | submit | geraneum's commentslogin

> Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors.

The model improvements being beyond human comprehension is one of the more ridiculous statements I’ve heard in the last couple of days about AI. We could reason about Higgs bosons and gravitational waves but have no ability to quantify or reason about the difference between Opus 4.7 vs 4.8.


I definitely believe that you can discern differences between Opus 4.6, 4.7, and 4.8. I might also believe that you believe that you can discern improvements between Opus 4.6, 4.7, and 4.8. But conclusively, consistently, scientifically, and blindly discerning improvement is at this point restricted to problem domains that represent a vanishingly small amount of global token usage, like Erdos problems, superhuman evals, and the like. The idea that typical line of business use-cases have seen broad and measurable improvements since even Opus 4.5 but certainly 4.6 is mostly an illusion that confuses improvements in the harness for improvements in the model, as well as confuses "its different" for "its better".

To be clear, again, cannot stress this enough: I am NOT saying that the models have hit a limit. I am saying that the complexity of the problems most businesses throw at them have always had a limit. The models are now so intelligent that we have not, as of yet, adapted our business use-cases to make use of the new levels of intelligence. Maybe we will.


I see this sentiment occasionally brought up, and at the same time see what’s happening to Github where the majority of their distributions is not security or efficiency related (not saying it’s because of LLMs, we don’t know). The point is, these things matter beyond beautiful code. You loose trust and you lose customers and money.

> Feelings aren’t fact... much of it has to do with rapid inflation and "continued vibes".

Oh the lost irony.


Is it ironic? Or did you just read the comment incorrectly?

There’s this trend that tries to sell the idea that, if LLMs and agents have any shortcomings, instead of them getting better we should lower the standards. Focus on the “MTTR”. Is the code bad? Don’t read it. Don’t review it. Remove the bottleneck (the human in the loop). This narrative is all over the place.

This tech is quite useful, and I wish we focused on how to work with the tool better and improved our processes around it instead of treating the symptoms.


Aren't people working on both? I'm sure the AI labs are working on their end of it. People are building better agents. You can work on skills or tweaking your AGENTS.md.

There no end to what you can do, but the question is how much time you devote to that versus your actual project.


I sometimes wonder if we’re losing HN for good.

> We can debate as to how successful we’ve been toward the two goals above

No not really. These are separate questions from what the article posits. The argument is about how do we use these tools, our approach as developers, and if the results are going to be as rosy as advertised.


Yeah because they are not auto regressively generated!

> PRs should be plan files, not code. Impl is trivial.

Doesn’t it bother you that the outcome each PR is different every time you/CI “run it”?


No, because consider the pre-AI status quo where a human PR will come in like "Added tab support", maybe scribbles out some guiding ideas, maybe references some issue where we kinda hashed out some ideas of how it could work, and then we must derive all of the intentions/assumptions/decisions of the implementor from the PR's code changes.

Basically zero plan. Or rather, the "internal" plan that the human implementor used while writing the code is hidden from us because it's a mix of ideas they held in their head, jotted in some notes, existed in a sequence of commits that were lost when squashed into a PR, etc. There's zero reproducibility in the implementation.

So take my idea and pretend we still don't have AI yet: the main point is that we move to a pipeline where we work on a first-class plan first before we begin implementation. This gets us closer to reproducible implementation no matter who is implementing it.

It just so happens that now with implementation becoming automated, we have more attention and energy freed up to focus on this plan-based model.


I don’t know why, but I get this feeling whenever someone uses “insanely” or “shockingly” along with AI, I think they’re bot or are writing based on a guideline! No offense, btw, I’m not saying you’re a bot.

I'm prepared to excise the word "genuinely" from my vocabulary after working with Claude.

One of my biggest fears with using AI at work is that I will subconsciously start talking and writing like a bot, despite making conscious efforts to do the opposite. Just like how when you read a lot of books by one author, their style infects your own writing style.


You’re absolutely right!

Kidding, nah no worries. I do worry people become overly paranoid of bots as time passes.



The people you see in the TV are not actually in the TV box. It looks real until you try to shake one’s hand. It’s kind of the same thing with AI (reasoning and whatnot).

I don't think it matters if the reasoning is philosophically "real" if it can solve real problems.

If you read my analogy in the context of the article, it should be clearer what I meant.

I think it would be even more clear if you just write what you mean.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: