Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is interesting:

> Autoencoder family

> Note: Only 65536 features available. Activations shown on The Pile (uncopyrighted) instead of our internal training dataset.

So, the Pile is uncopyrighted, but the internal training dataset is copyrighted? Copyrighted by whom?

Huh?



> Copyrighted by whom?

By people who would get angry if they could definitively prove their stuff was in OpenAI's training set.


Hehe, related to this, someone created a "book4" dataset and put it on torrent websites. I don't think it's being used in any major LLMs, but the future "piracy" community intersection with AI is going to be exciting.

Watching the cyberpunk world that all of my favorite literature predicted slowly come to our world is fun indeed.


i think you mean @sillysaurus' books3? not books4?



Basically everyone. You, and me, and Elon Musk, and EMPRESS, and my uncle who works for Nintendo. They're just hoping that AI training legally ignores copyright.


When you can ask an AI for an entire book with no errors in the output… god that would be a huge token model


Copyright violation isn't just when you can output 100% exact copies of books. And don't forget, they also violated copyright internally billions of times during training. If any of us had been caught making copies of corporate-owned content for AI training use five years ago, we'd be in for zillion-dollar lawsuits that would make any grandma who downloaded a song from Napster blush.


There is a very good argument to be made that training AI is fair use, as it is both transformative and does not compete with the original work. This has yet to be tested in court.


If you copy your cd for backup with no resale future, no one would waste time to sue you.


Because they wouldn't catch me. But if they did, especially if they caught me making a copy of every CD at the CD store as a backup, especially if they caught me making a copy of every bootleg CD I could get my hands on (as a backup), I'd be in big trouble.

Did you know a lot of LLM training data is scraped from illegal pirate libraries such as Anna's Archive?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: