Your company’s data isn’t nearly
as valuable for training models as you might think.
Certainly not as valuable as the revenue you can make from companies that would instantly cancel their Copliot 365 subscriptions if they heard any hint of data being used for training without permission.
> Convincing people that you don’t train on their data remains one of the hardest problems:
we attempted to protect our valuable data with copyright
they disregarded these terms, trained on it anyway and claim wholesale reproduction of our work is "fair use"
why wouldn't they do the same with Teams/Sharepoint/Word/everything on Azure
because the contract with a company 10000x our size says they won't? HAHAHAHAHA
the only way to protect your data from entities that have previously disregarded terms in this way is to not let them get their dirty hands on it in the first place
Did you read https://simonwillison.net/2023/Dec/14/ai-trust-crisis/ ? Because your comment here is a text-book example of what I was talking about there, right up to the bit where you say "you can't trust them because they've already shown they'll train on unlicensed scraped copyrighted data" (a very reasonable point to argue).
This exactly. If Microsoft had created its own fine-tuned-MSFT-data LLM and seen vastly better results on internal tasks, then they’d be publishing papers about it, and also packaging that up & selling to customers.
Certainly not as valuable as the revenue you can make from companies that would instantly cancel their Copliot 365 subscriptions if they heard any hint of data being used for training without permission.
Convincing people that you don’t train on their data remains one of the hardest problems: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/