I think there's been a lot of fear-mongering on this topic and "the inevitable L...

m_ke · on Dec 16, 2024

Those models will be here within a year.

Long context is practically a solved problem and there's a ton of work now on test time reasoning motivated by o1 showing that it's not that hard to RL a model into superhuman performance as long as the task is easy / cheap to validate (and there's works showing that if you can define the problem you can use an LLM to validate against your criteria).

angoragoats · on Dec 16, 2024

I intentionally glossed over a lot in my first comment, but I should clarify that I don't believe that increased context size or RL is sufficient to solve the problem I'm talking about.

Also "as long as the task is easy / cheap to validate" is a problematic statement if we're talking about the replacement of senior software engineers, because problem definition and development of validation criteria are core to the duties of a senior software engineer.

All of this is to say: I could be completely wrong, but I'll believe it when I see it. As I said elsewhere in the comments to another poster, if your points could be expressed in easily testable yes/no propositions with a timeframe attached, I'd likely be willing to bet real money against them.

m_ke · on Dec 16, 2024

Sorry I wasn't clear enough, the cheap to validate part is only needed to train a large base model that can handle writing individual functions / fix bugs. Planning a whole project, breaking it down into steps and executing each one is not something that current LLMs struggle at.

Here's a recipe for a human level LLM software engineer:

1. Pretrain an LLM on as much code and text as you can (done already)

2. Fine tune it on synthetic code specific tasks like: (a) given a function, hide the body, make the model implement it and validate that it's functionally equivalent to the target function (output matching), can also have an objective to optimize the runtime of the implementation (b) introduce bugs in existing code and make the LLM fix it, (c) make LLM make up problems, write tests / spec for it, then have it attempt to implement it many times until it comes up with a method that passes the tests, (d-z) a lot of other similar tasks that use linters, parsers, AST modifications, compilers, unit tests, specs validated by LLMs, profilers to check that the produced code is valid

3. Distill this success / failure criteria validator to a value function that can predict probability of success at each token to give immediate reward instead of requiring full roll out, then optimize the LLM on that.

4. At test time use this final LLM to produce multiple versions until one passes the criteria, for the cost of an hour of a software engineer you can have an LLM produce millions of different implementations.

See papers like: https://arxiv.org/abs/2409.15254 or slides from NeurIPS that I mentioned here https://news.ycombinator.com/item?id=42431382

angoragoats · on Dec 16, 2024

> At test time use this final LLM to produce multiple versions until one passes the criteria, for the cost of an hour of a software engineer you can have an LLM produce millions of different implementations.

If you're saying that it takes one software engineer one hour to produce comprehensive criteria that would allow this whole pipeline to work for a non-trivial software engineering task, this is where we violently disagree.

For this reason, I don't believe I'll be convinced by any additional citations or research, only by an actual demonstration of this working end-to-end with minimal human involvement (or at least, meaningfully less human involvement than it would take to just have engineers do the work).

edit: Put another way, what you describe here looks to me to be throwing a huge number of "virtual" low-skilled junior developers at the task and optimizing until you can be confident that one of them will produce a good-enough result. My contention is that this is not a valid methodology for reproducing/replacing the work of senior software engineers.

m_ke · on Dec 16, 2024

That's not what I'm saying at all. I'm saying that there's a trend showing that you can improve LLM performance significantly by having it generate multiple responses until it produces one that meets some criteria.

As an example, huggigface just posted an article showing this for math, where with some sampling you can get a 3B model to outperform a 70B one: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...

Formalizing the criteria is not as hard as you're making it out to be. You can have an LLM listen to a conversation with the "customer", ask follow up questions and define a clear spec just like a normal engineer. If you doubt it open up chatGPT, tell it you're working on X and ask it to ask you clarifying questions, then come up with a few proposal plans and then tell it which plan to follow.

angoragoats · on Dec 16, 2024

> That's not what I'm saying at all. I'm saying that there's a trend showing that you can improve LLM performance significantly by having it generate multiple responses until it produces one that meets some criteria.

I apologize for misinterpreting what you were saying -- I was clearly taking "for the cost of an hour of a software engineer" to mean something that you didn't intend.

> As an example, huggigface just posted an article showing this for math, where with some sampling you can get a 3B model to outperform a 70B one

This is not relevant to our discussion. Again, I'm reasonably sure that I'm not going to be convinced by any research demonstrating that X new tech can increase Y metric by Z%.

> Formalizing the criteria is not as hard as you're making it out to be. You can have an LLM listen to a conversation with the "customer", ask follow up questions and define a clear spec just like a normal engineer. If you doubt it open up chatGPT, tell it you're working on X and ask it to ask you clarifying questions, then come up with a few proposal plans and then tell it which plan to follow.

This is much more relevant to our discussion. Do you honestly feel this is an accurate representation of how you'd define the requirements for the pipeline you outlined in your post above? Keep in mind that we're talking about having LLMs work on already-existing large codebases, and I conceded earlier that writing boilerplate/base code for a brand new project is something that LLMs are already quite good at.

Have you worked as a software engineer for a long time? I don't want to assume anything, but all of your points thus far read to me like they're coming from a place of not having worked in software much.

m_ke · on Dec 16, 2024

> Have you worked as a software engineer for a long time? I don't want to assume anything, but all of your points thus far read to me like they're coming from a place of not having worked in software much.

Yes I've been a software engineer working in deep learning for over 10 years, including as an early employee at a leading computer vision company and a founder / CTO of another startup that built multiple large products that ended up getting acquired.

> I apologize for misinterpreting what you were saying -- I was clearly taking "for the cost of an hour of a software engineer" to mean something that you didn't intend.

I meant that unlike a software engineer, the LLM can do a lot more iterations on the problem given the same budget. So if your boss comes and says build me new dashboard page it can generate 1000s of iterations and use a human aligned reward model to rank them based on which one your boss might like best. (that's what the test time compute / sampling at inference does).

> This is not relevant to our discussion. Again, I'm reasonably sure that I'm not going to be convinced by any research demonstrating that X new tech can increase Y metric by Z%.

These are not just research papers, people are reproducing these results all over the place. Another example from a few minutes ago: https://x.com/DimitrisPapail/status/1868710703793873144

> This is much more relevant to our discussion. Do you honestly feel this is an accurate representation of how you'd define the requirements for the pipeline you outlined in your post above? Keep in mind that we're talking about having LLMs work on already-existing large codebases,

I'm saying this will be solved pretty soon, working with large codebases doesn't work well right now because last years models had shorter context and were not trained to deal with anything longer than a few thousand tokens. Training these models is expensive so all of the coding assistant tools like cursor / devin are sitting around and waiting for the next iteration of models from Anthropic / OpenAI / Google to fix this issue. We will most likely have announcements of new long context LLMs in the next 1-2 weeks from Google / OpenAI / Deepseek / Qwen that will make major improvements on large code bases.

I'd also add that we probably don't want huge sprawling code bases, when the cost of a small custom app that solves just your problem goes to 0 we'll have way more tiny apps / microservices that are much easier to maintain and replace when needed.

angoragoats · on Dec 16, 2024

> These are not just research papers, people are reproducing these results all over the place. Another example from a few minutes ago: https://x.com/DimitrisPapail/status/1868710703793873144

Maybe I'm not making myself clear, but when I said "demonstrating that X new tech can increase Y metric by Z%" that of course included reproduction of results. Again, this is not relevant to what I'm saying.

I'll repeat some of what I've said in several posts above, but hopefully I can be clearer about my position: while LLMs can generate code, I don't believe they can satisfactorily replace the work of a senior software engineer. I believe this because I don't think there's any viable path from (A) an LLM generates some code to (B) a well-designed, complete, maintainable system is produced that can be arbitrarily improved and extended, with meaningfully lower human time required. I believe this holds true no matter how powerful the LLM in (A) gets, how much it's trained, how long its context is, etc, which is why showing me research or coding benchmarks or huggingface links or some random twitter post is likely not going to change my mind.

> I'd also add that we probably don't want huge sprawling code bases

That's nice, but the reality is that there are lots of monoliths out there, including new ones being built every day. Microservices, while solving some of the problems that monoliths introduce, also have their own problems. Again, your claims reek of inexperience.

Edit: forgot the most important point, which is that you sort of dodged my question about whether you really think that "ask ChatGPT" is sufficient to generate requirements or validation criteria.