If the output is substantially similar to GPL’d training data it may be infringing. Nobody disputes this.
However, copyright isn’t cooties. If the output is not similar, then it is not infringing regardless of how much GPL’d training data was used to generate it.
Suspend all knowledge of copyright law as it exists today for a moment and approach this hypothetical on first principles: a lot of GPL copyleft data is used in the making of an AI tool, that when asked for it, can itself recreate code similar to what was input... also, the creator of that AI tool will reap in all the profits without giving a single penny or even recognition of the value it guzzled from GPL data it was trained on to creators of original copyleft data. Is this fair? What do your scruples tell you?
No, of course not. We should probably revisit copyright law, given that it was written at a time when no-one foresaw modern AI tools, its capabilities, and its effects on creators and societies.
The GPL was never about money, recognition or even abut the creators at all. Copy Left was created “to promote computer user freedom”
Free Software already views all proprietary software as inherently immoral. So there is no need to take a detour of what went into making the software to reach that conclusion from that angle.
>"How much change is enough" has always been a gray area for courts and humans to decide.
But copilot has been shown to generate chunks of sufficient size and specificity that as a layman it very much feels like "copied GPL code". And my boss agrees too - we have a blanket ban on generative AI tools in our work because it's not considered worth the risk.
> has been shown to generate chunks of sufficient size and specificity
Only when given chinks of copyrighted code as input. I don't think anyone has demonstrated big chunks of copyrighted code in the output when copyrighted code isn't present in the query/context.
In fact, I suspect microsoft specifically filters the output for that.
However, copyright isn’t cooties. If the output is not similar, then it is not infringing regardless of how much GPL’d training data was used to generate it.