Characterization of short tandem repeats using local assembly

Demir, Gülfem

Characterization of short tandem repeats using local assembly

Available

The embargo period has ended, and this item is now available.

Files

GDemir20170320.pdf (970.57 KB)

Date

2017-03

Authors

Demir, Gülfem

Advisor

Alkan, Can

BUIR Usage Stats

7
views

34
downloads

Abstract

Tandem repeats are pieces of DNA where a pattern has multiple consecutive copies adjacent to itself. If the repeat unit (pattern) consists of 2 to 6 nucleotides, it can be referred to as a short tandem repeat or a microsatellite. There are many genetic diseases (such as huntington disease and Fragile-X syndrome) linked with STR expansions and because tandem repeats make up 3% of the sequenced human genome, STR detection research is significant. STR variations have always been a challenge for genome assembly and sequence alignment due to their repetitive nature, sequencing errors, short read lengths, and high incidence of polymerase slippage at STR regions. Despite the information they carry being very valuable, STR variations have not gained enough attention to be a permanent step in genome sequence analysis pipelines. After the 1000 Genomes Project, which aimed to establish the most detailed genetic variation catalogue for humans, the consortium released only two STR prediction sets which are identified by two STR caller tools, lobSTR and RepeatSeq. Many other large research efforts have failed to shed light on STR variations. The main aim of this study is to use sequence assembly methods for regions where we know that there is an STR, based on reference genome, and release a complete pipeline from sample's reads to STR genotype. The assembly problem we are dealing with in the scope of this thesis can be considered as local assembly, which is the assembly procedure of reads that maps to a small part of the genome. We will be focusing on two general assembly approaches that make use of graph data structures: de Bruijn graph (DBG) based methods that rely on a variant of kmer graph, overlap-layout-consensus (OLC) methods that are based on an overlap graph. Even though sequence assembly is a well studied problem, there is not any work that uses assembly algorithms to characterize STRs. We demonstrate that using sequence assembly on STR regions increases the true positive rate compared to state-of-art tools. We evaluated the performance of three different local assembly methods on three different experimental settings: focusing on (i) genotype based performance, (ii) coverage impact, and (iii) evaluating pre-processing and including anking regions. All these experiments supported our initial expectations on using assembly. Besides, we show that OLC based assembly methods bring much higher sensitivity to STR variant calling when compared to DBG based approach. This concludes that assembly with OLC is a better way for genotyping STRs according to our experiments.

Keywords

Short tandem repeat, Sequence assembly, Next generation sequencing

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Permalink

http://hdl.handle.net/11693/32937

Collections

Graduate School of Engineering and Science

Language

English

Type

Thesis

Full item page

Characterization of short tandem repeats using local assembly

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Characterization of short tandem repeats using local assembly

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type