dc.contributor.advisor | Alkan, Can | |
dc.contributor.author | Fırtına, Can | |
dc.date.accessioned | 2018-01-10T13:45:51Z | |
dc.date.available | 2018-01-10T13:45:51Z | |
dc.date.copyright | 2017-12 | |
dc.date.issued | 2017-12 | |
dc.date.submitted | 2018-01-10 | |
dc.identifier.uri | http://hdl.handle.net/11693/35728 | |
dc.description | Cataloged from PDF version of article. | en_US |
dc.description | Thesis (M.S.): Bilkent University, Department of Computer Engineering, İhsan Doğramacı Bilkent University, 2017. | en_US |
dc.description | Includes bibliographical references (leaves 35-41). | en_US |
dc.description.abstract | Next Generation Sequencing technologies differ by several parameters where the choice to use
whether short or long read sequencing platforms often leads to trade-offs between accuracy
and read length. In this thesis, I first demonstrate the problems in reproducibility in analyses
using short reads. Our comprehensive analysis on the reproducibility of computational
characterization of genomic variants using high throughput sequencing data shows that repeats
might be prone to ambiguous mapping. Short reads are more vulnerable to repeats
and, thus, may cause reproducibility problems. Next, I introduce a novel algorithm Hercules,
the first machine learning-based long read error correction algorithm. Several studies
require long and accurate reads including de novo assembly, fusion and structural variation
detection. In such cases researchers often combine both technologies and the more erroneous
long reads are corrected using the short reads. Current approaches rely on various graph
based alignment techniques and do not take the error profile of the underlying technology
into account. Memory- and time- efficient machine learning algorithms that address these
shortcomings have the potential to achieve better and more accurate integration of these two
technologies. Our algorithm models every long read as a profile Hidden Markov Model with
respect to the underlying platform's error profile. The algorithm learns a posterior transition/
emission probability distribution for each long read and uses this to correct errors in
these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2),
and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have
the highest mapping rate among all competing algorithms and highest accuracy when most
of the basepairs of a long read are covered with short reads. | en_US |
dc.description.statementofresponsibility | by Can Fırtına. | en_US |
dc.format.extent | xiii, 68 leaves : charts (some color) ; 30 cm | en_US |
dc.language.iso | English | en_US |
dc.rights | info:eu-repo/semantics/openAccess | en_US |
dc.subject | DNA | en_US |
dc.subject | Sequencing | en_US |
dc.subject | Repeats | en_US |
dc.subject | Error correction | en_US |
dc.subject | Long reads | en_US |
dc.subject | Machine learning | en_US |
dc.title | Assessment and correction of errors in DNA sequencing technologies | en_US |
dc.title.alternative | DNA dizilim teknolojilerindeki hatalar üzerine değerlendirme ve hataların düzeltilmesi | en_US |
dc.type | Thesis | en_US |
dc.department | Department of Computer Engineering | en_US |
dc.publisher | Bilkent University | en_US |
dc.description.degree | M.S. | en_US |
dc.identifier.itemid | B157364 | |
dc.embargo.release | 2020-01-09 | |