Browsing by Author "Mutlu, Onur"

Now showing 1 - 10 of 10

Open Access
Accelerating genome analysis: a primer on an ongoing journey
(IEEE, 2020) Alser, M.; Zülal, Bingöl; Cali, D. S.; Kim, J.; Ghose, S.; Alkan, Can; Mutlu, Onur
Genome analysis fundamentally starts with a process known as read mapping, where sequenced fragments of an organism's genome are compared against a reference genome. Read mapping is currently a major bottleneck in the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are able to sequence a genome much faster than the computational techniques employed to analyze the genome. We describe the ongoing journey in significantly improving the performance of read mapping. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory). We conclude with the challenges of adopting these hardware-accelerated read mappers.
Open Access
AirLift: a fast and comprehensive technique for remapping alignments between reference genomes
(IEEE, 2024-08-19) Kim, Jeremie S.; Firtina, Can; Cavlak, Meryem Banu; Çalı, Damla Şenol; Hajinazar, Nastaran; Alser, Mohammed; Alkan, Can; Mutlu, Onur
AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4×. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants
Open Access
Apollo: A sequencing-technology-independent, scalable and accurate assembly polishing algorithm
(Oxford University Press, 2020-03) Fırtına, C.; Kim, J. S.; Alser, M.; Şenol Cali, D.; Çiçek, A. Ercüment; Alkan, Can; Mutlu, Onur
Motivation: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technologydependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results: We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and thirdgeneration). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward– Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts.
Open Access
FastRemap: a tool for quickly remapping reads between genome assemblies
(Oxford University Press, 2022-08-17) Kim, J. S.; Firtina, C.; Cavlak, M. B.; Cali, D. S.; Alkan, Can; Mutlu, Onur
Motivation: A genome read dataset can be quickly and efficiently remapped from one reference to another similar reference (e.g., between two reference versions or two similar species) using a variety of tools, e.g., the commonly used CrossMap tool. With the explosion of available genomic datasets and references, high-performance remapping tools will be even more important for keeping up with the computational demands of genome assembly and analysis. Results: We provide FastRemap, a fast and efficient tool for remapping reads between genome assemblies. FastRemap provides up to a 7.82 speedup (6.47, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap.
Open Access
GateKeeper-GPU: fast and accurate pre-alignment filtering in short read mapping
(IEEE, 2021-06-24) Bingöl, Zülal; Alser, Mohammed; Mutlu, Onur; Öztürk, Özcan; Alkan, Can
We introduce GateKeeper-GPU, a fast and accurate pre-alignment filter that efficiently reduces the need for expensive sequence alignment. GateKeeper-GPU improves the filtering accuracy of GateKeeper, and by exploiting the massive parallelism provided by GPU threads it concurrently examines numerous sequence pairs rapidly. GateKeeper-GPU is available at https://github.com/BilkentCompGen/GateKeeper-GPU. Please refer to the preprint at arXiv:2103.14978 for more information.
Open Access
GateKeeper-GPU: fast and accurate pre-alignment filtrering in short read mapping
(IEEE, 2024-05) Bingol, Zülal; Alser, Mohammed; Mutlu, Onur; Öztürk, Özcan; Alkan, Can
At the last step of short read mapping, the candidate locations of the reads on the reference genome are verified to compute their differences from the corresponding reference segments using sequence alignment algorithms. Calculating the similarities and differences between two sequences is still computationally expensive since approximate string matching techniques traditionally inherit dynamic programming algorithms with quadratic time and space complexity. We introduce GateKeeper-GPU, a fast and accurate pre-alignment filter that efficiently reduces the need for expensive sequence alignment. GateKeeper-GPU provides two main contributions: first, improving the filtering accuracy of GateKeeper (a lightweight pre-alignment filter), and second, exploiting the massive parallelism provided by the large number of GPU threads of modern GPUs to examine numerous sequence pairs rapidly and concurrently. By reducing the work, GateKeeper-GPU provides an acceleration of 2.9$\boldsymbol{\times}$ to sequence alignment and up to $1.4\boldsymbol{\times}$ speedup to the end-to-end execution time of a comprehensive read mapper (mrFAST). GateKeeper-GPU is available at https://github.com/BilkentCompGen/GateKeeper-GPU
Open Access
GenASM: a high-performance, low-power approximate string matching acceleration framework for genome sequence analysis
(IEEE Computer Society, 2020) Şenol-Çalı, D.; Kalsi, G. S.; Bingöl, Zülal; Fırtına, C.; Subramanian, L.; Kim, J. S.; Ausavarungnirun, R.; Alser, M.; Gomez-Luna, J.; Boroumand, A.; Norion, A.; Scibisz, A.; Subramoneyon, S.; Alkan, Can; Ghose, S.; Mutlu, Onur
Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. To perform genome sequencing, devices extract small random fragments of an organism's DNA sequence (known as reads). The first step of genome sequence analysis is a computational process known as read mapping. In read mapping, each fragment is matched to its potential location in the reference genome with the goal of identifying the original location of each read in the genome. Unfortunately, rapid genome sequencing is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM), which is used at multiple points during the mapping process. ASM enables read mapping to account for sequencing errors and genetic variations in the reads. We propose GenASM, the first ASM acceleration framework for genome sequence analysis. GenASM performs bitvectorbased ASM, which can efficiently accelerate multiple steps of genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint. Using this modified algorithm, we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized systolic-array-based compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth, resulting in an efficient design whose performance scales linearly as we increase the number of compute units working in parallel. We demonstrate that GenASM provides significant performance and power benefits for three different use cases in genome sequence analysis. First, GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116× and 3.9×, respectively, while reducing power consumption by 37× and 2.7×. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111× and 1.9×. Second, GenASM accelerates pre-alignment filtering for short reads, with 3.7× the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by 1.7× and significantly improving the filtering accuracy. Third, GenASM accelerates edit distance calculation, with 22-12501× and 9.3-400× speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while reducing power consumption by 548-582× and 67×. We conclude that GenASM is a flexible, high-performance, and low-power framework, and we briefly discuss four other use cases that can benefit from GenASM.
Open Access
Shouji: a fast and efficient pre-alignment filter for sequence alignment
(Oxford University Press, 2019) Alser, Mohammed; Hassan, H.; Kumar, A.; Mutlu, Onur; Alkan, Can
The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm. Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step.
Open Access
TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering
(Frontiers Research Foundation, 2024-10-28) Cavlak, Meryem Banu; Singh, Gagandeep; Alser, Mohammed; Firtina, Can; Lindegger, Joel; Sadrosadati, Mohammad; Mansouri Ghiasi, Nika; Alkan, Can; Mutlu, Onur
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, that is, reads. State-of-the-art basecallers use complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, most reads do not match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall’s key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads, and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31 × while maintaining high ( 98.88 % ) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality than prior works.
Open Access
Technology dictates algorithms: recent developments in read alignment
(BioMed Central, 2021-08-26) Alser, Mohammed; Rotman, J.; Deshpande, D.; Taraszka, K.; Shi, H.; Baykal, P. I.; Yang, H. T.; Xue, V.; Knyazev, S.; Singer, B. D.; Balliu, B.; Koslicki, D.; Skums, P.; Zelikovsky, A.; Alkan, Can; Mutlu, Onur; Mangul, S.
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.