Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but wh...

pornel · 2025-04-21T00:22:56 1745194976

It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

6510 · 2025-04-21T01:10:20 1745197820

Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.

vbezhenar · 2025-04-21T02:46:52 1745203612

How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.

simonw · 2025-04-21T04:27:14 1745209634

Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

mdp2021 · 2025-04-21T20:30:07 1745267407

> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

simonw · 2025-04-21T20:56:31 1745268991

There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

mdp2021 · 2025-04-21T22:31:26 1745274686

Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).

freeamz · 2025-04-21T09:27:06 1745227626

In Scandinavian financial related severs must in the country! That always sounded like a sane approach. The whole putting your data on saas or AWS just seems like the same "Let's shift the responsibility to a big player".

Any important data should NOT be in devices that is NOT physically with in our jurisdiction.