Hehe, related to this, someone created a "book4" dataset and put it on torrent websites. I don't think it's being used in any major LLMs, but the future "piracy" community intersection with AI is going to be exciting.
Watching the cyberpunk world that all of my favorite literature predicted slowly come to our world is fun indeed.
Basically everyone. You, and me, and Elon Musk, and EMPRESS, and my uncle who works for Nintendo. They're just hoping that AI training legally ignores copyright.
Copyright violation isn't just when you can output 100% exact copies of books. And don't forget, they also violated copyright internally billions of times during training. If any of us had been caught making copies of corporate-owned content for AI training use five years ago, we'd be in for zillion-dollar lawsuits that would make any grandma who downloaded a song from Napster blush.
There is a very good argument to be made that training AI is fair use, as it is both transformative and does not compete with the original work. This has yet to be tested in court.
Because they wouldn't catch me. But if they did, especially if they caught me making a copy of every CD at the CD store as a backup, especially if they caught me making a copy of every bootleg CD I could get my hands on (as a backup), I'd be in big trouble.
Did you know a lot of LLM training data is scraped from illegal pirate libraries such as Anna's Archive?
> Autoencoder family
> Note: Only 65536 features available. Activations shown on The Pile (uncopyrighted) instead of our internal training dataset.
So, the Pile is uncopyrighted, but the internal training dataset is copyrighted? Copyrighted by whom?
Huh?