Ask HN: Do big tech companies train large language models on our personal data?

smoldesu · on Feb 15, 2023

I don't see why they would, the average American's personal data is less coherent than the median written article on the internet. Unless you have evidence of it knowing much more than it should, it's doubtful your personal data is worth anything besides advertising entropy.

blurbleblurble · on Feb 15, 2023

Maybe to query for intent? Or to test out ad engagement strategies? Or for example: to build a realistic chatbot?

smoldesu · on Feb 15, 2023

If they are doing any of that, they're wasting a whole lot of processing power. Even old text transformers like GPT2 and BERT are capable of running offline without any of that nonsense. More open models like GPT-Neo can be fully audited to prove that there is no personal data in it's training stack.

There might be merit to what you're saying, but again, you've presented no proof of this. Common logic and the currently-available technology suggests that's unnecessary.

blurbleblurble · on Feb 15, 2023

You're right that I have no proof, and I'm not suggesting that they are, I'm wondering if anyone else has heard of anything like this happening, especially internally for research purposes.