A utility maximizing and privacy preserving approach for protecting kinship in genomic databases
Embargo Release Date2020-03-01
Please cite this item using this persistent URLhttp://hdl.handle.net/11693/32934
Okan, Öznur Taştan
Rapid and low cost sequencing of genomic data enables widespread use of genomic information in research studies and personalized customer applications, where people share their genomic data in public databases. Although the identities of the participants are anonymized in these databases, sensitive information about individuals can still be inferred if the stored data is not shared in a privacypreserving manner. Proper handling of kinship information is one such caveat that needs to be addressed to avoid exposure of privacy-sensitive information. In this work, we show that by using only the publicly available single nucleotide polymorphism (SNP) data of anonymized individuals, kinship relationships can be inferred. We present two scenarios that result in privacy leakage; one based on genomic similarity of the individuals; the other, through the outlier allele pair counts of the family members. In the proposed models, we assume that the family members join to the database sequentially and we systematically identify minimal portions of data to withhold as the new participants are added to the database. Choosing the proper positions to hide is cast as an optimization problem. Therein, the number of positions to mask is minimized subject to several privacy constraints that ensure the kinship information among any pair of the family members is not leaked. We evaluate the proposed technique on real genomic data of two different families of size five by considering different sequential arrival orders for the family members. Results indicate that concurrent sharing of data pertaining to a parent and an of spring results in high risks of privacy leakages, whereas the sharing data from further relatives together is often safer. We also show that different arrival orders of the members can lead to different levels of privacy risks and the utility of shared data can vary. Adoption of the proposed method shall allow safe sharing of genomic data in terms of kinship privacy in future research studies and public genomic services.