Most framework vendors don’t have an incentive to make things less obscure. The agent framework is free/open source and they make money primarily from selling observability products for agents. Even if they don’t intentionally obscure things, they just don’t have the motivation to optimize that part.
I'm personally interested in this problem and it's a quite active research area right now.
My feeling is that the research is converging to what the paper claims, that the combination of two is the right way to do it and it's a matter of how you combine the two as part of the harness you built that makes the difference.
At the AID-Wild / ACM CAIS 2026 workshop that happened recently, there are plenty of examples in the accepted papers on that.
A great example is AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve. It uses AlphaEvolve and Vizier to evolve compiler code-layout heuristics. (https://arxiv.org/abs/2606.00131)
The combination approach jives well with my use of the models in a number of areas. I guide models to use best-in-class algorithmic approaches as available. (Eg using constraint solves for a particular problem where pure Monte Carlo rarely gives "in-bounds" data.)
I find it odd that frontier models often don't suggest the most powerful methods for crushing problems, but it may be that the training data doesn't actually have "good enough" experts on the problems I encounter. If the experts don't know about the best ways to solve the problem, they'll get dinged in training for even trying.
Do you enumerate the options of the algorithms to the models? I've tried to do "algorithmic discovery" with these systems, e.g. openevolve, and to be honest the models didn't really focus on that part.
Instead they were focusing more on optimizations of the existing algorithm that has been implemented. Maybe it's an artifact of the problem I was throwing to them (I was asking to optimize the implementation of select_k in Arrow, which is currently using a max-heap streaming algorithm).
I've started documenting my journey with this here: https://www.kostasp.net/posts/16-ai-experiments-apache-arrow
in case you want to take a look. Any advice would be highly appreciated, I'm looking for more inspiration on how to torture myself with that stuff.
This is really neat. I’m working on something similar but for data artifacts not just code. It’s very encouraging to see that this kind of tooling helps both humans and models, that was what made me starting to work on that.
Thanks! The data artifacts angle is really interesting. in some ways the problem is even harder there because data pipelines have less explicit structure than code, I guess.
The artifacts themselves have more structure, but diffing is hard because of size: what exactly do you show in the different? Row-level? Summary statistics? How do you keep it from getting slow on bigger datasets?
Then there are plots saved as images which have basically no structure at all exposed.
Row level and summary stats are both diffs over values that can tell you that something changed but not whether the * meaning * has changed. What I'm working on is providing more information on how the meaning changes.
What questions I'd like to answer with the diffing is more like: will the grain go from one-row-per-user to one-row-per-user-per-day, will a key stop being unique, will a join start fanning out and quietly double a measure, will something additive become non-additive.
This diff is over structure but this structure is latent in the transformation that produces it and to make things harder, if we are talking about some declarative language being used (e.g. SQL) the code doesn't even describe how things are getting done, but what the output would be.
What I've ended up doing is recovering the structure from the code by analyzing it and then using * cheap * profiling than a full row compare.
As an example, my equivalent impact sub-command output would be something like this: "this change makes account_id non-unique three models downstream"
There is still no good "data diff" tool that I can run on, say, a big pile of CSV or Parquet. Something with DVC integration would be especially welcome.
I would imagine because at scales where most folks use parquet files, you’re generally no longer really thinking in terms of individual diffs to your data (and also does imply some level of batch processing, vs e.g. a DB).
We have some custom data diff tools at my ultracorp that provide a browsable interface, but the customer tends to be more operations folk than engineers or DS etc who would be more familiar with actual version control concepts. But these work against the data store and not on something like csv or parquet.
Sorta? Maybe I'm weird. I tend to use Parquet files inside my project instead of reading directly from and writing directly to our data warehouse. That lets me cut out a lot of overhead spent on just waiting for data to flow over the network, and also as a side benefit lets me track everything with DVC, which itself has a lot of benefits like being able to summon all project data with `dvc pull`.
I consider that a completely distinct use case from, say, Iceberg tables in S3.
Curious to see when a post from OpenAI will appear with the corrected theory or something. This seems to be an ideal scenario for them to go after another scientific case. They have the theory, they have the experimental proof that it’s wrong, exactly what you need for an agentic loop to do its work.
Or maybe what works in math doesn’t work with chemistry?
I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.
Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
Correct on the first part, partially correct on the second. LOC is a bad metric, but it is at least a legible one. Lots of people working on better ways to measure Software Productivity!
IIUC, the most basic version is when you have a log where every entry has both “date added” and “effective date,” so you can add stuff to the log retroactively. For example, “the user just informed us yesterday that they moved last year” -> address date added=yesterday, date effective=last year
I have similar setup in Orgzly (kinda in Emacs too but it's buggy and not not as useful there) where a note has a "created time" property that's always automatically applied. And then there's the "closed time" applied when I set note the state to "done", which I sometimes modify depending on what the note is for and thus what "done" means.
SQL, JS, Excel are really hard to substitute because of how widely used they are by people. Even if something new comes up that it's objectively better, so far has always failed gaining traction because of this reality.
I wonder though, is such a dialect better for agents? Have you tried to measure if an agent performs better expressing queries in such a language instead of SQL?
Claude had no problem translating SQL into Prela, and because you have fine grained control over the query plan (a Prela query is a plan), it was able to optimize queries to be very fast
I'm more curious about going from text to Prela instead of going from text to SQL and measuring any difference in the performance there. On one hand models have been trained on a lot of SQL on the other hand they are really good in mathematical reasoning too so thinking in Perla might be a natural fit for them.
Yes, maybe not the language itself, but the ideas behind it. Tarski's Algebra of Relations is actually a better model for modern columns stores than the standard relational algebra, because a column is a binary relation from the primary key into its value.
It would be pretty easy to put a DuckDB data source into this code.
It might be pretty easy to use overloading to get special case implementations that form SQL queries progressively until the results need to be materialized as something like a dataframe for the function code to work on.
Replicating the Postgres WAL to S3 and Iceberg reliably is a hard problem but it’s not accurate to say that no ETL is needed here.
maybe you can say it’s more of an ELT pattern but anyone who’s interested into using this for realistic analytics they will have to transform the data at some point.
If an org is early enough to think that they can use a solution like this and just get in duckdb and start spitting out reports, they will be up for a really bad experience.
Please educate people to do the right thing and realize the scope of the work they are facing, it might feel that it hurts your growth in the short term but it will benefit you greatly in the mid-long term as a vendor.
IDK, AWS Zero ETL from Autora into Redshift really helped us at some point. You right that data transformation is very limited if not possible. But having data in an analytical store, being able to experiment with queries, understand what is wrong with your OLTP schema and then build ETL is way better than doing an upfront design.
Of course it is. What you describe is one of the reasons that ELT became popular, if you couple it with a variant type and schema on read, you have a very powerful and flexible architecture.
But there’s no free lunch, building and maintains data infrastructure that is reliable requires work. Many companies don’t realise that when they start their analytical journey and aggressive marketing doesn’t help. That’s the point I was trying to make.
I don’t disagree, just placing emphasis on a different aspect.
In an ideal world there is a tool that moves your schema into an analytical store “as is”
with a single click. Then the same tool lets you add arbitrary transformations of the data. Surprisingly I have not come across such a tool. It is earthier “one click to move your data” or “any transformation you want” but only after a significant upfront investment :(
I think I didn’t articulate myself very well on my reply. I actually wanted to say that I agree with you and emphasise again the need for educating users for the complexity of these projects.
What you describe has been pitched by many different products for different parts of the data platform. Fivetran for example claims to do that for the extraction and loading part, good old Informatica was offering the ETL in a graphical interface etc.
The problem that many teams ended up having is the explosion of the tooling needed by data teams.
The comments are definitely not worth reading. It’s a very sad thread, you literally had to go through all of them to find one that wasn’t about hate and stating some facts about the issues of the code.
I found them worth reading for the following set of thoughts came up:
- programmers had problems with delivering quality long before LLM’s
- very much research and tools went into that, bringing us {Git, libraries, VSCode, reviews, …,} but the human factor stayed the same (and more pronounced imho than in other fields of engineering)
- LLMs democratized programming, enhancing a few, dropping the bottom to no skill programming
- the tools and practices created for the quality problems from the past turn out to be wholly incapable of maintaining quality in the present
The main problem behind this is that those delivering the QA tools of the past are central in the AI race. Old school engineering would separate these concerns.
reply