This is interesting: > Autoencoder family > Note: Only 65536 features available....

Arcsech · on June 6, 2024

> Copyrighted by whom?

By people who would get angry if they could definitively prove their stuff was in OpenAI's training set.

Der_Einzige · on June 6, 2024

Hehe, related to this, someone created a "book4" dataset and put it on torrent websites. I don't think it's being used in any major LLMs, but the future "piracy" community intersection with AI is going to be exciting.

Watching the cyberpunk world that all of my favorite literature predicted slowly come to our world is fun indeed.

swyx · on June 6, 2024

i think you mean @sillysaurus' books3? not books4?

Philpax · on June 7, 2024

https://news.ycombinator.com/item?id=40405443

immibis · on June 6, 2024

Basically everyone. You, and me, and Elon Musk, and EMPRESS, and my uncle who works for Nintendo. They're just hoping that AI training legally ignores copyright.

mensetmanusman · on June 6, 2024

When you can ask an AI for an entire book with no errors in the output… god that would be a huge token model

immibis · on June 7, 2024

Copyright violation isn't just when you can output 100% exact copies of books. And don't forget, they also violated copyright internally billions of times during training. If any of us had been caught making copies of corporate-owned content for AI training use five years ago, we'd be in for zillion-dollar lawsuits that would make any grandma who downloaded a song from Napster blush.

Karunamon · on June 7, 2024

There is a very good argument to be made that training AI is fair use, as it is both transformative and does not compete with the original work. This has yet to be tested in court.

mensetmanusman · on June 7, 2024

If you copy your cd for backup with no resale future, no one would waste time to sue you.

immibis · on June 7, 2024

Because they wouldn't catch me. But if they did, especially if they caught me making a copy of every CD at the CD store as a backup, especially if they caught me making a copy of every bootleg CD I could get my hands on (as a backup), I'd be in big trouble.

Did you know a lot of LLM training data is scraped from illegal pirate libraries such as Anna's Archive?