Massively parallel mapping of next generation sequence reads using GPU
Author
Korkmaz, Mustafa
Advisor
Aykanat, Cevdet
Date
2012Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
79
views
views
19
downloads
downloads
Abstract
The high throughput sequencing (HTS) methods have already started to fundamentally
revolutionize the area of genome research through low-cost and highthroughput
genome sequencing. However, the sheer size of data imposes various
computational challenges. For example, in the Illumina HiSeq2000, each run produces
over 7-8 billion short reads and over 600 Gb of base pairs of sequence data
within less than 10 days. For most applications, analysis of HTS data starts
with read mapping, i.e. nding the locations of these short sequence reads in a
reference genome assembly.
The similarities between two sequences can be determined by computing their
optimal global alignments using a dynamic programming method called the
Needleman-Wunsch algorithm. The Needleman-Wunsch algorithm is widely used
in hash-based DNA read mapping algorithms because of its guaranteed sensitivity.
However, the quadratic time complexity of this algorithm makes it highly timeconsuming
and the main bottleneck in analysis. In addition to this drawback, the
short length of reads ( 100 base pairs) and the large size of mammalian genomes
(3.1 Gbp for human) worsens the situation by requiring several hundreds to tens
of thousands of Needleman-Wunsch calculations per read. The fastest approach
proposed so far avoids Needleman-Wunsch and maps the data described above in
70 CPU days with lower sensitivity. More sensitive mapping approaches are even
slower. We propose that e cient parallel implementations of string comparison
will dramatically improve the running time of this process. With this motivation,
we propose to develop enhanced algorithms to exploit the parallel architecture of
GPUs.