Hacker Newsnew | past | comments | ask | show | jobs | submit | efromvt's commentslogin

Been working on optimizing CLIs for cheap agent use and figuring out how to build integrated agentic features that aren’t a full chat interface. Agent UX optimization is kind of fun! Much more testable than human UX, though it’ll be interesting to see how much generalizes across model families.

Been doing this to improve/simplify the grammar for Trilogy[1], a streamlined SQL language - I’ve been planning a redo of one feature and it’s nice to be able to rapidly get feedback on various syntax success rates. Also been particularly useful to optimize error messages, which should help people too.

[1] https://trilogydata.dev/


Fantastic to have PyO3/Maturin guides too - the rust/python/typescript turducken I’ve always wanted.

repetition of "belt-and-suspenders" kills me with opus, especially because it always means the model is suppressing something I would want to be an actual failure

I think the perception is that it is not 'only marginally better'; whether or not you specifically agree that perceived quality gap lets them differentiate on price.

I'd further say that there are probably enough rational actors running evals out there that the marginally better is not pure vibes for the cases where people are spending lots of money, but I only have direct line of sight to some of those eval suites. Maybe everyone is irrational and anthropic is exploiting that!


I think you can sympathize with the safety motives while still thinking this was a dumb implementation to degrade silently? I actually have faith in them getting the guardrail triggers pretty good, but consensus seems like they’re not yet there yet.

I think it is clear given the stakes why you would not want to make your guardrails probe-able/invertable.

The openrouter provider flakiness with deepseek was infuriating, but I’m happy in hindsight because direct deepseek has been very pleasant. Shocked by how low spend is.

I do slightly prefer 5.5 for complex work but Claude quota usage has gotten infinitely better since the dark days a few months back - has gone from being infuriating to something I pretty much don’t have to worry about with it as a daily driver. (In fact, hitting GPT weekly quotas is more annoying now). Understand if people are still scarred by the issues + poor comms around them, though.

That's good to hear. It was legitimately unusable back when 4.7 was released, so I had no choice at the time. I'm sure I'll ping pong back again at some point.

I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.

(I should go measure this now, I'm curious)


Deepseek cost/performance is incredible. That said, I still feel like for agentic coding we haven't plateaued (I slightly prefer GPT 5.5 to Claude for complex stuff, to be honest), and so the extra price is absolutely worth it to push you over the 'impossible' to 'feasible' bar on complex tasks. Once you're in a domain that Deepseek can handle though that requires volume, I would almost always default to it now.

For evals in particular (tuning workflows that agents are using), effectively not having to worry about price is an incredible multiplier - getting statistical significant signal is not cheap otherwise.


There's always a bulk insert, but I wouldn't say every engine has always had a reasonable way to bulk load truly large data... parquet really helped with interop but before that when your best option was a CSV and bcp life was not fun.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: