Genome reconstruction in beacons using summary statistics
Date
Authors
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
BUIR Usage Stats
views
downloads
Attention Stats
Series
Abstract
Genomic data-sharing beacons, designed to safeguard individual privacy while promoting scientific discovery, remain critically vulnerable to sophisticated genome reconstruction attacks that leverage publicly released summary statistics. This thesis systematically advances the understanding and effectiveness of these attacks, challenging the assumption that releasing simple allele frequencies (AFs) is a secure protocol. The fundamental flaw lies in the beacon’s protocol to account for linkage disequilibrium (LD), which allows a malicious party to infer individual data from combined summary statistics. Our foundational contribution established the feasibility of this threat with a two-stage optimization-based algorithm that utilized public LD and AFs, achieving an F1-score of 70% and confirming the inherent privacy risk. Building upon this, the research introduces a more powerful methodology: a single-stage joint optimization framework that unifies the objectives of SNP correlation and allele frequency alignment. This formulation not only increases reconstruction performance to an average F1-score of 71.4% but also yields substantial computational savings: reconstructing 2,000 SNPs across 100 individuals now requires 7.4 hours instead of 10 hours, representing a 26% reduction in runtime. Collectively, these results provide compelling evidence of the increasing practicality and sophistication of genome reconstruction attacks against beacon protocols, underscoring the urgent need for the development of robust, adaptive, and correlation-aware defense mechanisms to protect the integrity and privacy of genomic data infrastructure.