Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do you prove there is no PII in the ML model?

It has been proven countless times that it's possible to extract learning data from models. I can't see how you can prove the opposite, except, maybe, with federated learning (but even then, you need to good "ratio" of noise)



I suppose the models might theoretically at risk if the learning data can be extracted, but I don’t think this will practically happen because it’s so far different from current GDPR practices. Someone would have to prove their protected information is inside of a model before they might have a chance. After that, I am not a lawyer.


Of course you can’t prove that some data cannot be de-anonymized unless there are duplicate entries. However, GDPR explicitly encourages anonymization, or “pseudonymization”, which therefore suggests that reasonable attempts to keep data generic are considered legal by this particular law. People have already pointed out that GDPR’s language here is too vague and makes bad assumptions about how identifying multiple quasi-identifiers can be.


GDPR encourages pseudonymization as a best practice, but also draws a sharp distinction between anonymous and pseudonymous data. Pseudonymous data is still personal data and subject to all other obligations under GDPR. Any data that's pseudonymous would still be subject to the deletion order.


I shouldn’t have mentioned pseudonymization, that wasn’t my point. It doesn’t change the fact that the law is vague and to some degree contradicts itself, suggesting that data can be anonymous. There is a real and actual overlap between anonymized data and personally identifiable data. The way the GDPR is written, it would be extremely difficult to prosecute someone for breach of data they had taken best practice steps to anonymize. The law wasn’t written to handle ML based de-anonymization. It also doesn’t help here that if you Google PII, the hundreds and hundreds of examples are things like name and address, nothing remotely close to anonymous yet identifiable.


> How do you prove there is no PII in the ML model?

Is "innocent until proven guilty" not a maxim in European justice?


They have just been found guilty, that's what the ruling is, and the outcome of the ruling is they should delete data derived from the related data. The ML models took the data as input, I think it's fair to say that if they want to argue the ML models do not derive from it despite that, they should maintain the burden of proof.


It's not possible to discuss legality of something, until a judge said we are allowed to discuss it? What?

So I can murder someone, and say "Innocent until proven guilty", and forbid anyone from discussing whether I'm a murderer, until I'm actually judged guilty?

But ok, sounds like you're nitpicking on my words, so let me rephrase the comment you're replying to.

"Considering that we have dozens of research papers showing that public models contain PII, how can we trust that FAANG's private models doesn't without auditing? It sounds safe to assume it does contain PII"


> So I can murder someone, and say "Innocent until proven guilty", and forbid anyone from discussing whether I'm a murderer, until I'm actually judged guilty?

In many European countries it's in fact against the law to publish the name of a suspect until a court has found them guilty. And is some it's even illegal to publish the name at all.


That's not what I said. I said that it would presumably require a trial to prove that the ML models contain PII, as opposed to the government being able to assume they do and demanding the company prove they don't to some arbitrary standard.


Generally not in administrative law. Executive authorities (e.g. tax office) make some decision and you can appeal to administrative court, but you have to prove why the decision was bad.


OK, that's interesting. Thank you.


If you choose to handle PII you have to keep track of where it ends up. Feeding PII into a black box and pretending it isn't there anymore without taking reasonable precautions, especially on something like ML that is known to leak its input, doesn't seem like it should be an option. If you don't know the safe assumption should be that the ML model can leak PII and should be destroyed along with the training data.


It is, however in this case innocence means having a complete paper trail of your data processing as defined under GDPR. Not having such a paper trail is one of the things the IAB was found guilty of in this ruling.


Not in money laundering.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: