Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For all the Model Cards and License notices, I find it interesting there is not much information on the contents of the dataset used for training. Specifically, if it contains data subject to Copyright restrictions. Or did I miss that?


Yeah, its an unspoken but rampant thing in the llm community. Basically no one respects licenses for training data.

I'd say the majority of instruct tunes, for instance, use OpenAI output (which is against their TOS).

But its all just research! So who cares! Or at least, that seems to be the mood.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: