Using nothing more than a simple vial of saliva, millions of people have created DNA profiles on genealogy websites.
But this wealth of information is effectively inaccessible to genetics researchers, with the sites painstakingly safeguarding their databases, fearful of a leak that could cost them dearly in terms of credibility.
This problem of access is one that Bonnie Berger, a professor of mathematics at the Massachusetts Institute of Technology, and her colleagues think they can solve, with a new cryptographic system to protect the information.
"We're currently at a stalemate in sharing all this genomic data," Berger told AFP. "It's really hard for researchers to get any of their data, so they're not really helping science."
"No one can gain access to help them find the link between genetic variations and disease," she said. "But just think what could happen if we could leverage the millions of genomes out there."
The idea of this new cryptographic method, described Thursday in the U.S. journal Science, was developed in connection to finding drug candidates in datasets from pharmaceuticals companies.
In an earlier work, the researchers have shown the concept could be applied to DNA profiles.
Labs are constantly looking to identify links between millions of drug compounds and the tens of thousands of proteins in the human body, to identify good candidates for certain drugs.
But they don't want their competitors to know what they are working on. Often, their drug compounds are patented and secret. So they don't share much.
SECRET SHARING
With the researchers' new scalable technique, the first based on a secure "neural network," Berger explained, labs could share their sensitive data, dividing it between several servers that would run to find new links based on the data sample as a whole.
But no entity would be able to access the initial inputs, which might include proprietary information -- provided they don't decide to collude with each other.
Each entity would get results based on its contributions.
Berger says their technique is based on a cryptographic framework called "secret sharing."
The researchers introduced new optimization and artificial intelligence techniques to be able to handle the millions of chemical compounds or genomes that need to be analyzed.
"We can do something that was absolutely not possible before," the MIT professor said, noting that existing cryptographic methods involve unwieldy large-scale computer calculations and communications costs.
They also only work for thousands of data points, not millions.
The same technique could allow the major genealogy websites, like Ancestry.com and 23andMe, to open their databases to researchers and pool them.
Ancestry has more than 10 million registered profiles, while 23andMe has more than five million.
Berger told AFP she had been in contact with both companies about her findings.
Ancestry, 23andMe, MyHeritage and others offer physical, genealogical and sometimes even medical data -- such as a history of cancer in the family. It is this information that researchers want to match against certain genetic variations.
23andMe has taken a step in this direction, via a partnership with pharmaceuticals group GlaxoSmithKline (GSK). A 23andMe spokesman told AFP that scientific collaborations have led to the publication of about 100 research articles.
But the company only offers researchers a statistical summary of the results, in this format: "30 percent of males aged 20-35 have reported being diagnosed with X disease and have Y variants/mutations in common."
And user participation is on a voluntary basis, which limits the scope of the findings.
PRIVACY CONCERNS
The intersection of genetics and genealogy has made headlines in the United States. Last week, a new study showed that half of all Americans could be identified from relatives' DNA samples found in GEDmatch, a free website.
This technique has been a boon for U.S. police forces, who have used it to identify suspects in cold cases dating back decades, such as the "Golden State Killer," who is blamed for 12 murders and more than 50 rapes starting in the mid-1970s.
It can also be used by people looking for their biological parents.
But what happens if the data falls into the wrong hands? Hackers could potentially exploit the information to nefarious ends. Or what if insurance companies and others used it to discriminate against customers?
Benjamin Berkman, a bioethics researcher at the National Institutes of Health, told AFP there is "not really evidence of systemic discrimination," but noted that "doesn't mean that it couldn't become a problem."
"People are very worried about genomic privacy. It's something that they cite as a reason why they're not getting genetic testing, or they're not enrolling in research," Berkman said.