Accelerating the understanding of life's code through better algorithms and hardware design

buir.advisorAlkan, Can
dc.contributor.authorAlser, Mohammed H. K.
dc.date.accessioned2018-09-14T08:26:54Z
dc.date.available2018-09-14T08:26:54Z
dc.date.copyright2018-06
dc.date.issued2018-06
dc.date.submitted2018-09-13
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionThesis (M.S.): Bilkent University, Department of Computer Engineering, İhsan Doğramacı Bilkent University, 2018.en_US
dc.descriptionIncludes bibliographical references (leaves 119-133).en_US
dc.description.abstractOur understanding of human genomes today is affected by the ability of modern computing technology to quickly and accurately determine an individual's entire genome. Over the past decade, high throughput sequencing (HTS) technologies have opened the door to remarkable biomedical discoveries through its ability to generate hundreds of millions to billions of DNA segments per run along with a substantial reduction in time and cost. However, this ood of sequencing data continues to overwhelm the processing capacity of existing algorithms and hardware. To analyze a patient's genome, each of these segments - called reads - must be mapped to a reference genome based on the similarity between a read and \candidate" locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (1) it is implemented using quadratic-time dynamic programming algorithms, and (2) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. In this thesis, we introduce four new algorithms that function as a prealignment step and aim to filter out most incorrect candidate locations. We call our algorithms GateKeeper, Slider, MAGNET, and SneakySnake. The first key idea of our proposed pre-alignment filters is to provide high filtering accuracy by correctly detecting all similar segments shared between two sequences. The second key idea is to exploit the massively parallel architecture of modern FPGAs for accelerating our four proposed filtering algorithms. We also develop an efficient CPU implementation of the SneakySnake algorithm for commodity desktops and servers, which are largely available to bioinformaticians without the hassle of handling hardware complexity. We evaluate the benefits and downsides of our pre-alignment filtering approach in detail using 12 real datasets across different read length and edit distance thresholds. In our evaluation, we demonstrate that our hardware pre-alignment filters show two to three orders of magnitude speedup over their equivalent CPU implementations. We also demonstrate that integrating our hardware pre-alignment filters with the state-of-the-art read aligners reduces the aligner's execution time by up to 21.5x. Finally, we show that efficient CPU implementation of pre-alignment filtering still provides significant benefits. We show that SneakySnake on average reduces the execution time of the best performing CPU-based read aligners Edlib and Parasail, by up to 43x and 57.9x, respectively. The key conclusion of this thesis is that developing a fast and efficient filtering heuristic, and developing a better understanding of its accuracy together leads to significant reduction in read alignment's execution time, without sacrificing any of the aligner' capabilities. We hope and believe that our new architectures and algorithms catalyze their adoption in existing and future genome analysis pipelines.en_US
dc.description.provenanceSubmitted by Betül Özen (ozen@bilkent.edu.tr) on 2018-09-14T08:26:54Z No. of bitstreams: 1 Thesis_Mohammed_Alser_Final.pdf: 10538699 bytes, checksum: 16b0560acee6be5b65a6677e162b0310 (MD5)en
dc.description.provenanceMade available in DSpace on 2018-09-14T08:26:54Z (GMT). No. of bitstreams: 1 Thesis_Mohammed_Alser_Final.pdf: 10538699 bytes, checksum: 16b0560acee6be5b65a6677e162b0310 (MD5) Previous issue date: 2018-08en
dc.description.statementofresponsibilityby Mohammed H. K. Alser.en_US
dc.format.extentxxii, 138 leaves : charts ; 30 cm.en_US
dc.identifier.itemidB151551
dc.identifier.urihttp://hdl.handle.net/11693/47878
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectRead Mappingen_US
dc.subjectApproximate String Matchingen_US
dc.subjectRead Alignmenten_US
dc.subjectLevenshtein Distanceen_US
dc.subjectString Algorithmsen_US
dc.subjectEdit Distanceen_US
dc.subjectFast Pre-Alignment Filteren_US
dc.subjectField-Programmable Gate Arrays (FPGA)en_US
dc.titleAccelerating the understanding of life's code through better algorithms and hardware designen_US
dc.title.alternativeYaşamın kodunu anlamayı daha iyi algoritmalar ve donanım tasarımlarıyla hızlandırmaken_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelDoctoral
thesis.degree.namePh.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis_Mohammed_Alser_Final.pdf
Size:
10.05 MB
Format:
Adobe Portable Document Format
Description:
Full printable version

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: