If the output is substantially similar to GPL’d training data it may be infringi...

pen2l · on July 11, 2024

Suspend all knowledge of copyright law as it exists today for a moment and approach this hypothetical on first principles: a lot of GPL copyleft data is used in the making of an AI tool, that when asked for it, can itself recreate code similar to what was input... also, the creator of that AI tool will reap in all the profits without giving a single penny or even recognition of the value it guzzled from GPL data it was trained on to creators of original copyleft data. Is this fair? What do your scruples tell you?

No, of course not. We should probably revisit copyright law, given that it was written at a time when no-one foresaw modern AI tools, its capabilities, and its effects on creators and societies.

zarzavat · on July 12, 2024

Have you used Copilot? It is generally not creating code similar to GPL code, it is creating code similar to the surrounding context file.

Transformers predict the most likely next token, the most likely next token is usually related to the surrounding context.

So yes it can create code similar to GPL code but it can only do that consistently when the GPL code is included in the context. So don’t do that.

CuriousSkeptic · on July 11, 2024

The GPL was never about money, recognition or even abut the creators at all. Copy Left was created “to promote computer user freedom”

Free Software already views all proprietary software as inherently immoral. So there is no need to take a detour of what went into making the software to reach that conclusion from that angle.

kimixa · on July 11, 2024

Indeed, that's why I said

>"How much change is enough" has always been a gray area for courts and humans to decide.

But copilot has been shown to generate chunks of sufficient size and specificity that as a layman it very much feels like "copied GPL code". And my boss agrees too - we have a blanket ban on generative AI tools in our work because it's not considered worth the risk.

londons_explore · on July 11, 2024

> has been shown to generate chunks of sufficient size and specificity

Only when given chinks of copyrighted code as input. I don't think anyone has demonstrated big chunks of copyrighted code in the output when copyrighted code isn't present in the query/context.

In fact, I suspect microsoft specifically filters the output for that.