Characterization of short tandem repeats using local assembly
Please cite this item using this persistent URLhttp://hdl.handle.net/11693/32937
Tandem repeats are pieces of DNA where a pattern has multiple consecutive copies adjacent to itself. If the repeat unit (pattern) consists of 2 to 6 nucleotides, it can be referred to as a short tandem repeat or a microsatellite. There are many genetic diseases (such as huntington disease and Fragile-X syndrome) linked with STR expansions and because tandem repeats make up 3% of the sequenced human genome, STR detection research is significant. STR variations have always been a challenge for genome assembly and sequence alignment due to their repetitive nature, sequencing errors, short read lengths, and high incidence of polymerase slippage at STR regions. Despite the information they carry being very valuable, STR variations have not gained enough attention to be a permanent step in genome sequence analysis pipelines. After the 1000 Genomes Project, which aimed to establish the most detailed genetic variation catalogue for humans, the consortium released only two STR prediction sets which are identified by two STR caller tools, lobSTR and RepeatSeq. Many other large research efforts have failed to shed light on STR variations. The main aim of this study is to use sequence assembly methods for regions where we know that there is an STR, based on reference genome, and release a complete pipeline from sample's reads to STR genotype. The assembly problem we are dealing with in the scope of this thesis can be considered as local assembly, which is the assembly procedure of reads that maps to a small part of the genome. We will be focusing on two general assembly approaches that make use of graph data structures: de Bruijn graph (DBG) based methods that rely on a variant of kmer graph, overlap-layout-consensus (OLC) methods that are based on an overlap graph. Even though sequence assembly is a well studied problem, there is not any work that uses assembly algorithms to characterize STRs. We demonstrate that using sequence assembly on STR regions increases the true positive rate compared to state-of-art tools. We evaluated the performance of three different local assembly methods on three different experimental settings: focusing on (i) genotype based performance, (ii) coverage impact, and (iii) evaluating pre-processing and including anking regions. All these experiments supported our initial expectations on using assembly. Besides, we show that OLC based assembly methods bring much higher sensitivity to STR variant calling when compared to DBG based approach. This concludes that assembly with OLC is a better way for genotyping STRs according to our experiments.