Spadis: selecting predictive and diverse SNPS in GWAS
Author
Yılmaz, Serhan
Advisor
Çiçek, A. Ercüment.
Date
2018-08Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
129
views
views
63
downloads
downloads
Abstract
Phenotypic heritability of complex traits and diseases is seldom explained by individual
genetic variants identi ed in genome-wide association studies (GWAS).
Many methods have been developed to select a subset of variant loci, which are
associated with or predictive of the phenotype. Selecting connected Single Nucleotide
Polymorphisms (SNPs) on SNP-SNP networks has been proven successful
in nding biologically interpretable and predictive SNPs. However, we argue that
the connectedness constraint favors selecting redundant features that a ect similar
biological processes and therefore does not necessarily yield better predictive
performance. To this end, we propose a novel method called SPADIS that favors
the selection of remotely located SNPs in order to account for their complementary
e ects in explaining a phenotype. SPADIS selects a diverse set of loci on a
SNP-SNP network. This is achieved by maximizing a submodular set function
with a greedy algorithm that ensures a constant factor (1 − 1=e) approximation
to the optimal solution. We compare SPADIS to the state-of-the-art method
SConES, on a dataset of Arabidopsis Thaliana with continuous
owering time
phenotypes. SPADIS has better average phenotype prediction performance in 15
out of 17 phenotypes when the same number of SNPs are selected and provides
consistent improvements across multiple networks and settings on average. Moreover,
it identi es more candidate genes and runs faster. We also investigate the
use of Hi-C data to construct SNP-SNP network in the context of SNP selection
problem for the rst time, which yields improvements in regression performance
across all methods.