These models can still be trained on data that they’re allowed to use, but I think that what we’re seeing is that the better LLM services are probably trained with shocking amounts of private data, whereas the less performant probably don’t use stolen data.
Textbooks are a big one that I suspect we’ll probably see a set of suits over. Particularly because they seem to be some of the most valuable training data.
These models can still be trained on data that they’re allowed to use, but I think that what we’re seeing is that the better LLM services are probably trained with shocking amounts of private data, whereas the less performant probably don’t use stolen data.
Textbooks are a big one that I suspect we’ll probably see a set of suits over. Particularly because they seem to be some of the most valuable training data.