The data has already been sold off to the real customers (i.e., not you and me) [1]. You can (and should) request a deletion, but the damage has already been done.
This is false, we've sold data with PII to no one. Or it is misleading: the page you linked to even says, "It is selling de-identified, aggregate data for research, if you give them consent."
To what extent and using what method is it "de-identified"? Plenty of such schemes are very easy to circumvent, especially with a large enough pool of data. Given the nature of genetics in particular positively identifying a single case can be used to unmask whole families. In particular depending on the anonymization this would be a task suited to 'AI' very well.
Basically, if you imagine this as a table of "user's name, date of birth, and address" keys mapping to genomic and other data, the key was replaced with a random identifier that could not be trivially joined to recover the user name, date of birth, and address.
These systems are not robust against motivated and capitalized adversaries.
I can go to a data broker and purchase access to de-identified EMR data for most of the U.S. population. There are much more useful de-identified datasets around than ours, if someone is motivated to try to re-identify those datasets. That data is all bought and sold without anyone's consent and this is all fine under HIPAA.
I wasn't trying to convince anybody otherwise. I think the noise about 23&Me's data to be pretty uninteresting. I published my own genome (through PGP) for anybody to download, and I know that people have identified me from my post https://news.ycombinator.com/item?id=7641201 and other comments.
That's more or less what I expected. Ah well, the odds that this becomes something of significance to most people seems remote, but either way you can't unring the bell.
Here "de-identified" means stripped of PII (name, address, phone number, email, etc). You are correct that genetic information is intrinsically identifiABLE (in the sense that it is stable and uniquely distinguishing for individuals). When we've shared individual-level data with a partner, it was with consent of the participants involved, and under a contract that prohibits re-identification.
I would not argue with you on that it is "selling your data". But I also think there are meaningful differences in harm levels for different kinds of "selling your data", and fully identified data has more potential harms than de-identified data where you have to assume that an adversary is willing to violate contracts and/or the law to learn about particular individuals.
There is considerable confusion about the distinction between aggregated data and de-identified individual-level data. I would say that I don't consider sufficiently aggregated data to be "your data" in a particularly meaningful personal sense of "your", even though there are still some re-identification risks from these types of datasets.
I was contesting the statement that "The data has already been sold... [and] the damage is already done" which I still think is highly misleading.
I don't think 23andme has been casual or callous with people's data; they are probably a step above the average firm that handles this sort of data. The consent process is well-documented.
My complaint has always been about 23&Me has always been Anne Wojciki's naivete about the utility of genomic data for health treatment, as well as whether her company needed to work with the government (she wrote a useful retrospective that helped shed light: https://hbr.org/2020/09/23andmes-ceo-on-the-struggle-to-get-...).
Most of us who worked in genomics at the time were sort of dumbfounded by her approach and wanted to know what magic she had that let her get as far as she did with the company and its product.
I don't have any problem with the family history side of the product; that's how my dad found out that he had a number of unexpected children (IVF through donated sperm) who were able to connect with him years after their conception. And I really wish disease genetics had turned out to be far more straightforward as I've long been fascinated with how complex phenotypes arise from genomes.
The top-of-thread linked to an article that was specifically about aggregated data sharing, not individual-level data sharing. The consent document you're linking to did not exist when that article was written; our general research consent only covers aggregated data sharing. It was only in 2018 that we added the second-level consent for individual-level data sharing.
I think we've generally been pretty careful to present only scientifically well supported results, which has not helped the perceived utility of our health product. There are certainly valid arguments to be had about the business model.
Indeed, but here "re-identification" generally means the sort of attack where you have an aggregated genomic dataset, and you already have access to full genomic data for a target individual, and you use the genomic dataset to infer something about that target that you didn't know, like whether or not they participated in that study. Not to entirely minimize this sort of attack, but the NIH decided it was a sufficiently low risk that most of the sorts of datasets it applies to (like GWAS) are routinely shared with no access controls.
are you asking about methods to improve privacy of aggregated datasets? They seem to be not super popular with people in the field, I think because they sharply curtail how data can be used compared to having access to datasets with no strong privacy guarantees. I think the maybe more impactful recent shift is toward "trusted research environments" where you get to work with a particular dataset only in a controlled setting with actively monitored egress.
Homomorphic encryption enables standard GWAS workflows (not just summary stats) while “sharing” all genotypes and phenotypes. Richard Mott and colleagues have a paper and colleagues on this method;
[1] https://gizmodo.com/23andme-is-selling-your-data-but-not-how...