Accelerating the understanding of life's code through better algorithms and hardware design
Author
Alser, Mohammed H. K.
Advisor
Alkan, Can
Date
2018-08Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
922
views
views
352
downloads
downloads
Abstract
Our understanding of human genomes today is affected by the ability of modern
computing technology to quickly and accurately determine an individual's entire
genome. Over the past decade, high throughput sequencing (HTS) technologies
have opened the door to remarkable biomedical discoveries through its ability to
generate hundreds of millions to billions of DNA segments per run along with a
substantial reduction in time and cost. However, this
ood of sequencing data
continues to overwhelm the processing capacity of existing algorithms and hardware.
To analyze a patient's genome, each of these segments - called reads - must
be mapped to a reference genome based on the similarity between a read and
\candidate" locations in that reference genome. The similarity measurement,
called alignment, formulated as an approximate string matching problem, is the
computational bottleneck because: (1) it is implemented using quadratic-time
dynamic programming algorithms, and (2) the majority of candidate locations
in the reference genome do not align with a given read due to high dissimilarity.
Calculating the alignment of such incorrect candidate locations consumes an
overwhelming majority of a modern read mapper's execution time. Therefore, it
is crucial to develop a fast and effective filter that can detect incorrect candidate
locations and eliminate them before invoking computationally costly alignment
algorithms.
In this thesis, we introduce four new algorithms that function as a prealignment
step and aim to filter out most incorrect candidate locations. We
call our algorithms GateKeeper, Slider, MAGNET, and SneakySnake. The first
key idea of our proposed pre-alignment filters is to provide high filtering accuracy
by correctly detecting all similar segments shared between two sequences.
The second key idea is to exploit the massively parallel architecture of modern
FPGAs for accelerating our four proposed filtering algorithms. We also develop
an efficient CPU implementation of the SneakySnake algorithm for commodity
desktops and servers, which are largely available to bioinformaticians without the
hassle of handling hardware complexity. We evaluate the benefits and downsides
of our pre-alignment filtering approach in detail using 12 real datasets across different
read length and edit distance thresholds. In our evaluation, we demonstrate
that our hardware pre-alignment filters show two to three orders of magnitude
speedup over their equivalent CPU implementations. We also demonstrate that
integrating our hardware pre-alignment filters with the state-of-the-art read aligners
reduces the aligner's execution time by up to 21.5x. Finally, we show that
efficient CPU implementation of pre-alignment filtering still provides significant
benefits. We show that SneakySnake on average reduces the execution time of
the best performing CPU-based read aligners Edlib and Parasail, by up to 43x
and 57.9x, respectively. The key conclusion of this thesis is that developing a fast
and efficient filtering heuristic, and developing a better understanding of its accuracy
together leads to significant reduction in read alignment's execution time,
without sacrificing any of the aligner' capabilities. We hope and believe that our
new architectures and algorithms catalyze their adoption in existing and future
genome analysis pipelines.
Keywords
Read MappingApproximate String Matching
Read Alignment
Levenshtein Distance
String Algorithms
Edit Distance
Fast Pre-Alignment Filter
Field-Programmable Gate Arrays (FPGA)