Browsing by Subject "Read mapping"
Now showing 1 - 8 of 8
- Results Per Page
- Sort Options
Item Open Access Computational pan-genomics: status, promises and challenges(Oxford University Press, 2018-01-01) The Computational Pan-Genomics Consortium; Alkan, CanMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed.We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pangenomics can help address many of the problems currently faced in various domains.Item Open Access A cryptocurrency incentivized voluntary grid computing platform for DNA read alignment(2019-09) Özercan, Halil İbrahimThe main computational bottleneck of High Throughput Sequencing (HTS) data analysis is to map the reads to a reference genome, for which clusters are typically used. However, building clusters large enough to handle hundreds of petabytes of data is infeasible. Additionally, the reference genome is also periodically updated to x errors and include newly sequenced insertions, therefore in many large scale genome projects the reads are realigned to the new reference. Therefore, we need to explore volunteer grid computing technologies to help ameliorate the need for large clusters. However, since the computational demands of HTS read mapping is substantial, and the turnaround of analysis should be fast, we also need a method to motivate volunteers to dedicate their computational resources. For this purpose, we propose to merge distributed read mapping techniques with the popular blockchain technology. Cryptocurrencies such as Bitcoin calculate a value (called nonce) to ensure new block (i.e., \money") creations are limited and di cult in the system, however, this calculation serves no other practical purpose. Our solution (Coinami) introduces a new cryptocurrency called Halocoin, which rewards scienti c work with alternative minting. In Coinami, read alignment problems are published and distributed in a decentralized manner while volunteers are rewarded for their work. Authorities have two main tasks in our system: 1) inject new problem sets (i.e., \alignment problems") into the system, and 2) check for the validity of the results to prevent counterfeit.Item Open Access Distributed stream-processing framework for graph-based sequence alignment(2020-01) Gökkaya, Alim ŞükrücanOptimized the sequence alignment pipelines are needed to minimize the time required to complete processing the short-read genomic data. Today there are many sequence alignment tools exist, yet few of them are capable of directly ingesting the streaming base-call data. The sequencing has to be entirely completed before the mainstream aligners can begin mapping the reads to the reference. The sequencing process can take days to complete. The output is then needs to be demultiplexed into individual reads and aligned to the reference, which can take several more hours. Overall time of a genomic analysis can be shortened significantly by progressively computing the alignments at the time when the reads are still being generated. It is important to have genomic analysis done as quickly as possible, especially in life critical situations. Here we introduce a distributed stream processing framework for aligning short-reads into a graph representation of the genome. The massively parallel nature of the genomic sequencing data requires a massively parallel computation architecture. Thus we have designed our pipeline called R2G2Flow to align many reads to a de Bruijn graph in parallel. Our aligning method is specialized for the sequencing technologies that are based on base-call cycles, such as produced by Illumina. The results are made available soon after the final bases from the sequencing devices has been emitted. R2G2Flow is available at https://github.com/BilkentCompGen/r2g2Item Open Access Fast and accurate mapping of complete genomics reads(Academic Press, 2015) Lee, D.; Hormozdiari, F.; Xin, H.; Hach, F.; Mutlu, O.; Alkan C.Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.Item Open Access GateKeeper-GPU: accelerated pre-alignment filtering in short read mapping(2020-08) Bingöl, ZülalRecent advances in high throughput sequencing (HTS) facilitate fast production of short DNA fragments (reads) in numerous amounts. Although the production is becoming inexpensive everyday, processing the present data for sequence alignment as a whole procedure is still computationally expensive. As the last step of alignment, the candidate locations of short reads on the reference genome are verified in accordance with their difference from the corresponding reference segment with the least possible error. In this sense, comparison of reads and reference segments requires approximate string matching techniques which traditionally inherit dynamic programming algorithms. Performing dynamic programming for each of the read and reference segment pair makes alignment, a computationally-costly stage for mapping process. So, accelerating this stage is expected to improve alignment performance in terms execution time. Here, we propose, GateKeeper-GPU, a fast pre-alignment filter to be performed before verification to get rid of the sequence pairs, which exceed a predefined error threshold, for reducing the computational load on the dynamic programming. We choose GateKeeper as the filtration algorithm, we improve and implement it on a GPGPU platform with CUDA framework to obtain benefit from performing compute-intensive work with highly parallel and independent millions of threads for boosting performance. GateKeeper-GPU can accelerate verification stage by up to 2.9× and provide up to 1.4× speedup for overall read alignment procedure when integrated with mrFAST, while producing up to 52× less number of false accept pairs than original GateKeeper work.Item Open Access GateKeeper-GPU: fast and accurate pre-alignment filtering in short read mapping(IEEE, 2021-06-24) Bingöl, Zülal; Alser, Mohammed; Mutlu, Onur; Öztürk, Özcan; Alkan, CanWe introduce GateKeeper-GPU, a fast and accurate pre-alignment filter that efficiently reduces the need for expensive sequence alignment. GateKeeper-GPU improves the filtering accuracy of GateKeeper, and by exploiting the massive parallelism provided by GPU threads it concurrently examines numerous sequence pairs rapidly. GateKeeper-GPU is available at https://github.com/BilkentCompGen/GateKeeper-GPU. Please refer to the preprint at arXiv:2103.14978 for more information.Item Open Access MAGNET: understanding and improving the accuracy of genome pre-alignment filtering(I P S I, 2017) Alser, M.; Mutlu, O.; Alkan C.In the era of high throughput DNA sequencing (HTS) technologies, calculating the edit distance (i.e.,the minimum number of substitutions, insertions, and deletionsbetween a pair of sequences) forbillions of genomicsequences is the computational bottleneck intoday’s read mappers. The shifted Hamming distance (SHD) algorithm proposes afast filtering strategy that can rapidly filter out invalid mappings that have more edits than allowed. However, SHD shows high inaccuracy in its filtering by admitting invalid mappings to be marked as correct ones. This wastesthe execution time and imposesa large computational burden. In this work, we comprehensively investigate foursources that lead to the filtering inaccuracy. We propose MAGNET, anewfiltering strategy that maintains high accuracy across different edit distance thresholds and data sets. It significantly improvestheaccuracy of pre-alignment filtering by one to twoordersof magnitude.The MATLAB implementationsof MAGNETand SHDareopen source and available at:https://github.com/BilkentCompGen/MAGNET.Item Open Access SeGraM: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping(Association for Computing Machinery, 2020-06-11) Cali, D.Ş; Kanellopoulos, K.; Lindegger, J.; Bingöl, Zülal; Kalsi, G.S.; Zuo, Z.; Fırtına, Can; Cavlak, M.B.; Kim, J.; Ghiasi, N.M.; Singh, G.; Gómez-Luna, J.; Almadhoun Alserr, N.; Alser, M.; Subramoney, S.; Alkan, Can; Ghose, S.; Mutlu, O.A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.