More

avi_vallarapu · on May 5, 2024

We need to consider the practicality of unlearning methods in real-world applications and the legal acceptance of the same.

Given current technology and what advancements are needed to make Unlearning more possible, probably there should be a time-to-unlearn kind of an acceptable agreement that allows organizations to retrain or tune the response that does not involve any response from the to-be-unlearned copyright content.

Ultimately, legal acceptance for unlearning may be all about deleting the data set that is part of any kind of violations from the training data set. It may be very challenging to otherwise prove legally through the proposed unlearning techniques, that the model does not produce any type of response involving the private data.

The actual data set contains the private data violating privacy or copyright, and the model is trained on it, period. This means, it must involve retraining by deleting the documents/data to be unlearned.

isodev · on May 5, 2024

> a time-to-unlearn kind of an acceptable agreement

Why put the burden to end users? I think the technology should allow for unlearning and even "never learn about me in any future models and derivative models".

avi_vallarapu · on May 5, 2024

No technology can guarantee 100% unlearning, and the only 100% guarantee is when the data is deleted before the model is retrained. Legally, even 99.99% accuracy may not be acceptable, but, only 100%.

mr_toad · on May 6, 2024

> the only 100% guarantee is when the data is deleted before the model is retrained

That’s not even a guarantee. A model can hallucinate information about anyone, and by sheer luck some of those hallucinations will be correct. And as a consequence of forging (see section 2.2.1) you’d never be able to prove whether the data was in the training set or not.

eru · on May 6, 2024

Or rather some legal fiction that you can pretend is 100%. You can never achieve real 100% in practice after all. Eg the random initialisation of weights might already encode all the 'bad' stuff you don't want. Extremely unlikely, but not strictly 0% unlikely.

The law cuts off at some point, and declares it 100%.

isodev · on May 6, 2024

All this is technically correct, but it also means this technology is absolutely not ready to be used for anything remotely involving humans or end user data.

eru · on May 6, 2024

Why? We use random data in lots of applications, and there's always the theoretical probability that it could 'spell something naughty'.

isodev · on May 6, 2024

It's about models' ability to unlearn information or to configure their training environment so that something is never learned in the first place... is not exactly the same as "oups, we logged your IP in a log by accident".

A company is liable even if they have accidentally retained / failed to delete personal information. That's why we have a lot of standards and compliance regulation to ensure a bare minimum of practices and checks are performed. There is also the cyber resilience act coming up.

If your tool is used by/for humans, you need beyond 100% certitude exactly what happens with their data and how it can be deleted and updated.

eru · on May 7, 2024

You can never even got to 100% certainty, yet alone 'beyond' that.

Google can't even get 100% certainty that they eg deleted a photo you uploaded. No AI involved. They can get an impressive number of 9s in their 99.9..%, but never 100%.

So this complaint when taken to the absolute like you want to take it, says nothing about Machine Learning at all. It's far too general.

Vampiero · on May 5, 2024

The technology is on par with a Markov chain that's grown a little too much. It has no notion of "you", not in the conventional sense at least. Putting the infrastructure in place to allow people (and things) to be blacklisted from training is all you can really do, and even then it's a massive effort. The current models are not trained in such a way that you can do this without starting over from scratch.

Retric · on May 5, 2024

That’s hardly accurate. Deep learning among other things is another type of lossy compression algorithm.

It doesn’t have a 1:1 mapping of each bit of information it’s been trained with, but you can very much extract a subset of that data. Which is why it’s easy to get DallE to recreate the Mona Lisa, variations on that image show up repeatedly in its training courpus.

xg15 · on May 5, 2024

Well then, maybe we shouldn't use the technology.

friendzis · on May 6, 2024

> We need to consider the practicality of unlearning methods in real-world applications and the legal acceptance of the same. > probably there should be a time-to-unlearn kind of an acceptable agreement

A very important distinction is between data storage and data use/dissemination. Your comment hints at "use current model until retrained is available and validated", which is an extremely dangerous idea.

Remember old times of music albums distributed over physical media. Suppose a publisher creates a mix, stocks shelves with album and it becomes known that one of the tracks is not properly licensed. It would be expected that it takes some time to execute distribution shutdown: distribute order, clean up shelves, etc. However, time for another production run with a modified tracklist would be entirely the problem of the publisher in question.

The window for time-to-unlearn should only depend on practicality of stopping information dissemination, not getting updated source ready. Otherwise companies will simply wait for model to be retrained on a single 1080 and call it a day, which would effectively nullify the law.

beeboobaa3 · on May 5, 2024

How to deal with "unlearning" is the problem of the org running the illegal models. If I have submitted a gdpr deletion request you better honor it. If it turns out you stole copyrighted content you should get punished for that. No one cares how much it might cost you to retrain your models. You put yourself in that situation to begin with.

avi_vallarapu · on May 5, 2024

Exactly, I think is where it leads to eventually. And that is what I my original comment meant as well. "Delete it" rather than using some more techniques to "unlearn it", unless you claim the unlearning is 100% accurate.

visarga · on May 5, 2024

> No one cares how much it might cost you to retrain your models.

Playing tough? But it's misguided. "No one cares how much it might cost you to fix the damn internet"

If you wanted to retro-fix facts, even if that could be achieved on a trained model, it would still get back by way of RAG or web search. But we don't ask pure LLMs for facts and news unless we are stupid.

If someone wanted to pirate a content it would be easier to use Google search or torrents than generative AI. It would be faster, cheaper and higher quality. AIs move slow, are expensive, rate limited and lossy. AI providers have in-built checks to prevent copyright infringement.

If someone wanted to build something dangerous, it would be easier to hire a specialist than to chatGPT their way into it. All LLMs know is also on Google Search. Achieve security by cleaning the internet first.

The answer to all AI data issues - PII, Copyright, Dangerous Information - is coming back to the issue of Google search offering links to it, and websites hosting this information online. You can't fix AI without fixing the internet.

beeboobaa3 · on May 5, 2024

What do you mean playing tough? These are existing laws that should be enforced. The amount of people's lives ruined by the American government because they were deemed copyright infringers is insane. The us has made it clear that copyright infringement is unacceptable.

We now have a new class of criminals infringing on copyright on a grand scale via their models and they seem desperate to avoid persecution hence all this bullshit.

cscurmudgeon · on May 5, 2024

1. You are assuming just training a model on copyrighted material is a violation. It is not. It may be under certain conditions but not by default.

2. Why should we aim for harsh punitive punishments just because it was done so in the past?

beeboobaa3 · on May 5, 2024

> 1. You are assuming just training a model on copyrighted material is a violation. It is not. It may be under certain conditions but not by default.

Using copyrighted content for commercial purposes should be a violation if it's not already considered to be one. No different from playing copyrighted songs in your restaurant without paying a licensing fee.

> 2. Why should we aim for harsh punitive punishments just because it was done so in the past?

I'd be fine with abolishing, or overhauling, the copyright system. This rules with harsh penalties for consumers/small companies but not for bigtech double standard is bullshit, though.

ekianjo · on May 6, 2024

> Using copyrighted content for commercial purposes should be a violation

so reading a book and using the book contents to help you in your job would be a violation too based on your logic

beeboobaa3 · on May 6, 2024

A business cannot read a book, and your machine learning model is not given human rights.

Dylan16807 · on May 6, 2024

> A business cannot read a book

Assume the human read the book as part of their job. Is that using copyrighted material for commercial purposes?

If that doesn't count then I'm not sure why you brought up "commercial purposes" at all.

> This rules with harsh penalties for consumers/small companies but not for bigtech double standard is bullshit, though.

Consumers and small companies get away with small copyright violations all the time. And still bigger than having your image be one of millions in a training set.

beeboobaa3 · on May 6, 2024

> Assume the human

Humans have rights. They get to do things that businesses, and machine learning models, or general automation, don't.

Just like you can sit in a library and tell people the contents of books when they ask, but if you go ahead and upload everything you get bullied into suicide by the US government[1]

> Consumers and small companies get away with small copyright violations all the time

Yeah, because people don't notice so they don't care. Everyone knows what these bigtech criminals are doing.

[1] https://en.wikipedia.org/wiki/Aaron_Swartz

Dylan16807 · on May 6, 2024

> Humans have rights. They get to do things that businesses, and machine learning models, or general automation, don't.

So is that a yes to my question?

If humans are allowed to do it for commercial purposes, and it's entirely about human versus machine, then why did you say "Using copyrighted content for commercial purposes should be a violation" in the first place?

> Just like you can sit in a library and tell people the contents of books when they ask,

You know there a huge difference between describing a book and uploading the entire contents verbatim, right?

If "tell the contents" means reading the book out loud, that becomes illegal as soon as enough people are listening to make it a public performance.

> but if you go ahead and upload everything you get bullied into suicide by the US government[1]

They did that to a human... So I've totally lost track of what your point is now.

beeboobaa3 · on May 6, 2024

> and it's entirely about human versus machine

It's not. Those were what's called examples. There is of course more to it. Stop trying to pigeonhole a complex discussion onto a few talking points. There are many reasons why what OpenAI did is bad, and I gave you a few examples.

Dylan16807 · on May 6, 2024

I'm not trying to be reductive or nitpick your example, I was trying to understand your original statement and I still don't understand it.

There's a reason I keep asking a very generic "why did you bring it up", it's because I'm not trying to pigeonhole.

But if it's not worth explaining at this point and the conversation should be over, that's okay.

ekianjo · on May 13, 2024

A business is... made of people.

avi_vallarapu · on May 5, 2024

I posted my thought on another thread too : https://news.ycombinator.com/item?id=40231332

- Postgres documentation is one of the well maintained database documentations. This also means that developers, committers ensure changes to documentations for every relevant patch.

- talk about bugs in postgres compared to MySQl or Oracle or etc databases. Bugs are comparatively lesser or generally rare even if you are supporting postgres services as a vendor with lots of customers. the reason is the efforts involved by a strong team of developers in not accepting anything and everything, there are strict best practices, reviews, discussions, tests, and a lot more that makes it difficult for a patch or a feature to make it to a release.

- ultimately, more easy is the acceptance of a patch, more the number of bugs.

I love Postgres the way it is today and it still is the dbms of the year and developers most loved database.

I wish we have more Contributors, committers, developers and also users and companies supporting Postgres so that the time to push a feature gets more faster and reasonably easier with more support.

avi_vallarapu · on May 4, 2024

Theoretically this sounds great. I would worry about scalability issues with the Bayesian learning models practical implementation when dealing with the vast parameter space and data requirements of state of the-art models like GPT-3 and beyond.

Would love to see practical implementations on large-scale datasets and in varied contexts. I Liked the use of Dirichlet distributions to approximate any prior over multinomial distributions.

avi_vallarapu · on May 3, 2024

there is possibly a need for more unified standard across different implementations particularly from a software development and API design perspective.

During parsing and manipulation of JSON data, the syntactical discrepancies/behaviours between various libraries might need a common specification, for interoperability.

features like type-aware queries or schema validation, may be very helpful.

avi_vallarapu · on May 2, 2024

A few things to note

- Postgres documentation is one of the well maintained database documentations. This also means that developers, committers ensure changes to documentations for every relevant patch.

- talk about bugs in postgres compared to MySQl or Oracle or etc databases. Nugs are comparatively lesser or generally rare even if you are supporting postgres services as a vendor with lots of customer. the reason is the efforts involved by a strong team of developers in not accepting anything and everything, there are strict best practices, reviews, discussions, tests, and a lot more that makes it difficult to pass to a release.

- ultimately, more easy is the acceptance of a patch, more the number of bugs.

I love Postgres the way it is today and it still is the dbms of the year and developers most loved database.

I wish we have more Contributors committers, developers and also users and companies supporting Postgres so that the time to push a feature gets more faster and reasonable easier with more support.

avi_vallarapu · on May 2, 2024

Is this a coincidemnce ? Or a pattern leading to an outlier ?

avi_vallarapu · on April 23, 2024

GitHub : https://github.com/HexaCluster/pgdsat

Inviting Contributors and Users to try

avi_vallarapu · on Jan 2, 2024

As already expected, PostgreSQL is the DBMS of the year 2023 https://db-engines.com/en/blog_post/106

This is incredible to see its popularity only growing.

avi_vallarapu · on Nov 27, 2023

It is a simple use case, but it can extended to more powerful AI chatbots

avi_vallarapu · on April 6, 2023

I believe that the great examples always arise from OpenSource projects. The design and modular code always play a great role in increasing the ability to customize or add features and also have an increase in collaboration from more volunteers. Developer life gets more interesting with a massively improved code quality when some tough decisions are taken much earlier.

yCombLinks · on April 7, 2023

There are great examples in open source, but my main issue is open source projects are usually frameworks. Frameworks solve a lower level problem than end user application code. I often see excess abstraction and unneeded flexibility in business code that just makes the problem domain harder to understand, without ever providing value.