Browsing by Subject "Long reads"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Unknown Assessment and correction of errors in DNA sequencing technologies(2017-12) Fırtına, CanNext Generation Sequencing technologies differ by several parameters where the choice to use whether short or long read sequencing platforms often leads to trade-offs between accuracy and read length. In this thesis, I first demonstrate the problems in reproducibility in analyses using short reads. Our comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data shows that repeats might be prone to ambiguous mapping. Short reads are more vulnerable to repeats and, thus, may cause reproducibility problems. Next, I introduce a novel algorithm Hercules, the first machine learning-based long read error correction algorithm. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Our algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/ emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.Item Unknown Large structural variation discovery using long reads with several degrees of error(2020-12) Ebren, EzgiGenomic structural variations (SVs) are briefly defined as large-scale alterations of DNA content, copy, and organization. Although significant progress has been made since the introduction of high throughput sequencing (HTS) in character-izing SVs, accurate detection of complex SVs and balanced rearrangements still remains elusive due to the sequence complexity at the breakpoints. Until very recently, the difficulty of read mapping in such regions when the reads were short and the high error rates of long read platforms kept the problem challenging. However, with the introduction of the Pacific Biosciences’ High Fidelity (HiFi) sequencing methodology, powerful SV detection and breakpoint resolution be-came possible as a result of its capability to produce highly accurate (> 99%) long reads (10 − 20 kbps). Here, we introduce DALEK, a novel algorithm that aims to use long-read tech-nologies to discover large structural variations with high break-point resolution. DALEK uses split read and read depth signatures from long read data to dis-cover large (≥ 10 kbps) deletions, inversions and segmental duplications. We also develop methods to detect large SVs in existing high-error Oxford Nanopore Technologies data.