Algorithms for structural variation discovery using multiple sequence signatures
Author
Söylev, Arda
Advisor
Alkan, Can
Date
2018-09Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
162
views
views
517
downloads
downloads
Abstract
Genomic variations including single nucleotide polymorphisms (SNPs), small
INDELs and structural variations (SVs) are known to have significant phenotypic
effects on individuals. Among them, SVs, that alter more than 50 nucleotides
of DNA, are the major source of complex genetic diseases such as Crohn's,
schizophrenia and autism. Additionally, the total number of nucleotides affected
by SVs are substantially higher than SNPs (3.5 Mbp SNP, 15-20 Mbp SV). Today,
we are able to perform whole genome sequencing (WGS) by utilizing high
throughput sequencing technology (HTS) to discover these modifications unimaginably
faster, cheaper and more accurate than before. However, as demonstrated
in the 1000 Genomes Project, HTS technology still has significant limitations.
The major problem lies in the short read lengths (<250 bp) produced by the current
sequencing platforms and the fact that most genomes include large amounts
of repeats make it very challenging to unambiguously map and accurately characterize
genomic variants. Thus, most of the existing SV discovery tools focus on
detecting relatively simple types of SVs such as insertions, deletions, and short
inversions. In fact, other types of SVs including the complex ones are of crucial
importance and several have been associated with genomic disorders. To better
understand the contribution of these SVs to human genome, we need new approaches
to accurately discover and genotype such variants. Therefore, there is
still a need for accurate algorithms to fully characterize a broader spectrum of
SVs and thus improve calling accuracy of more simple variants.
Here we introduce TARDIS that harbors novel algorithms to accurately characterize
various types of SVs including deletions, novel sequence insertions, inversions,
transposon insertions, nuclear mitochondria insertions, tandem duplications
and interspersed segmental duplications in direct or inverted orientations
using short read whole genome sequencing datasets. Within our framework, we
make use of multiple sequence signatures including read pair, read depth and
split read in order to capture different sequence signatures and increase our SV
prediction accuracy. Additionally, we are able to analyze more than one possible
mapping location of each read to overcome the problems associated with repeated
nature of genomes. Recently, due to the limitations of short-read sequencing technology,
newer library preparation techniques emerged and 10x Genomics is one
of these initiatives. This technique is regarded as a cost-effective alternative to
long read sequencing, which can obtain long range contiguity information. We
extended TARDIS to be able to utilize Linked-Read information of 10x Genomics
to overcome some of the constraints of short-read sequencing technology.
We evaluated the prediction performance of our algorithms through several
experiments using both simulated and real data sets. In the simulation experiments,
TARDIS achieved 97.67% sensitivity with only 1.12% false discovery rate.
For experiments that involve real data, we used two haploid genomes (CHM1
and CHM13) and one human genome (NA12878) from the Illumina Platinum
Genomes set. Comparison of our results with orthogonal PacBio call sets from
the same genomes revealed higher accuracy for TARDIS than state of the art
methods. Furthermore, we showed a surprisingly low false discovery rate of our
approach for discovery of tandem, direct and inverted interspersed segmental duplications
prediction on CHM1 (less than 5% for the top 50 predictions). The
algorithms we describe here are the first to predict insertion location and the
various types of new segmental duplications using HTS data.