Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?
I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?
Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)
Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.
Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account.
It's safer to have a complete ban on providers that may collect data for training.
Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.
Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.
There is no world in which training on customer data without permission would be worth it for AWS.
? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...
Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.
There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.
That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.
> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".
If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.
And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).
In Scandinavian financial related severs must in the country! That always sounded like a sane approach. The whole putting your data on saas or AWS just seems like the same "Let's shift the responsibility to a big player".
Any important data should NOT be in devices that is NOT physically with in our jurisdiction.
I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?
Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)