Browsing by Subject "Genomics."
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Open Access Bias correction in finding copy number variation with using read depth-based methods in exome sequencing data(Bilkent University, 2014) Balcı, FatmaMedical research has striven for identifying the causes of disorders with the ultimate goal of establishing therapeutic treatments and finding cures since its early years. This aim is now becoming a reality thanks to recent developments in whole genome (WGS) and whole exome sequencing (WES). Despite the decrease in the cost of sequencing, WGS is still a very costly approach because of the need to evaluate large number of populations for more concise results. Therefore, sequencing only the protein coding regions (WES) is a more cost effective alternative. With the help of WES approach, most of the functionally important variants can be detected. Additionally, single nucleotide polymorphisms (SNPs) that are located within coding regions are the most common causes for Mendelian diseases (i.e. diseases caused by a single mutation). Moreover, WES approaches require less analysis effort compared to whole genome sequencing approaches since only 1% of whole genome is sequenced. Besides the advantages, there are also some shortcomings that need to be addressed such as biases in GC−content and probe efficiency. Although there are some previous studies on correcting GC−content related issues, there are no studies on correcting probe efficiency effect. In this thesis, we provide a formal study on the effects of both GC−content and probe efficiency on the distribution of read depth in exome sequencing data. The correction of probe efficiency will make it possible to develop new CNV discovery methods using exome sequencing data.Item Open Access Early postzygotic mutations contribute to de novo variation in a healthy monozygotic twin pair(Bilkent University, 2014) Dal, Gülşah MerveCharacterizing the patterns and rate of de novo mutations is crucial for our perception of evolution and genetic basis of human disease. Direct observation of de novo single nucleotide variation (SNV) rate in healthy individuals revealed a rate in a range of 0.82 – 1.70 ×10-8 base pair per generation. However, the developmental timing of the de novo mutations is unknown and thus, contribution of the early post-zygotic mutations to the human de novo SNV rate remained unknown. In an attempt to estimate the rate of de novo mutations regarding the developmental timing of mutagenesis, we sequenced the whole genomes of a healthy monozygotic twin pair and their parents with a total of 170 fold coverage. We identified the de novo SNVs through examination of the genotypes of each individual for each of the variants in a synchronous manner. Subsequent to the Sanger sequencing based validation, we conservatively characterized a total of 32 de novo SNVs. Of these 23 were shared by the twin pair, 8 were specific to twin I, and 1 was specific to twin II. We estimated the overall de novo SNV rate of 1.31 × 10-8 for twin I and 1.01 × 10-8 for twin II. The rate of the early post-zygotic de novo SNVs was calculated to be 0.34 × 10-8 and 0.04 × 10-8 for twin I and twin II, respectively. These data indicate the growing importance of genome mosaicism which might be resulted from de novo mutations of early post-zygotic origin in disease pathogenesis.Item Open Access Genome scaffolding using poled clone sequencing(Bilkent University, 2014) Dal, ElifThe DNA sequencing technologies hold great promise in generating information that will guide scientists to learn more about how the genome affects human health, organismal evolution, and genetic relationships between individuals of the same species. The process of generating raw genome sequence data becomes cheaper, faster, but more error prone. Assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ in terms of their performance, and in their final output. More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. In this thesis, we evaluated the accuracies of several genome scaffolding algorithms using two different types of data generated from the genome of the same human individual: i) whole genome shotgun sequencing (WGS), and ii) pooled clone sequencing (PCS). We observed that, it is possible to obtain less number of scaffolds with longer total assemble length if PCS data is used, compared to using only WGS data. However, the current scaffolding algorithms are developed only for WGS, and PCS-aware scaffolding algorithms remain an open problem.Item Open Access Integrating biological pathways and genomic profiles with ChiBE 2(Bilkent University, 2013) Çakır, MerveBiological pathways store information about spatial and temporal organization of interactions taking place in an organism. They hold valuable information that can assist scientific community in understanding the details of a particular mechanism or deciphering the reasons of disruption when the system goes wrong. However, extracting knowledge from these pathways is not trivial as they can be huge and complicated. Additionally, simple visualization of pathways will only reveal limited knowledge, whereas their integration with experimental results can identify distinct and intriguing relationships. Therefore, it is critical to have tools that are specialized in analyzing and understanding biological pathways. ChiBE is one such tool that can visualize, manipulate and analyze pathway data stored in BioPAX format. While preparing the second version of the tool, there have been improvements regarding pathway searches, high throughput data integration, and database connections. Visual notation has also been updated in order to follow standards in visualizations defined by the SBGN community. Previously defined pathway query algorithms have been adapted to be compatible with the BioPAX model. New query types have also been designed to offer a wider range of options. With these queries, ChiBE now offers a variety of ways of pathway decomposition and thorough analysis of complex pathway views. There has also been improvements in integration of high throughput experimental results. To offer easy access to expression microarrays, a gateway to the GEO database has been added. The cBio Cancer Genomics Portal is also now reachable within ChiBE in order to obtain information about genomic status of various cancer cells. After simply asking for an identifier of a particular experiment, ChiBE retrieves the results from databases and then integrates them with the available pathway view through color codes. Furthermore, a connection to DAVID database is available, in case users want to annotate a list of genes with respect to biological terms associated with them. With these new features and improvements, ChiBE 2 has become a comprehensive tool that offers a wide range of analysis options with a genomics-oriented workflow to deepen our understanding of biological pathways.Item Open Access Massively parallel mapping of next generation sequence reads using GPU(Bilkent University, 2012) Korkmaz, MustafaThe high throughput sequencing (HTS) methods have already started to fundamentally revolutionize the area of genome research through low-cost and highthroughput genome sequencing. However, the sheer size of data imposes various computational challenges. For example, in the Illumina HiSeq2000, each run produces over 7-8 billion short reads and over 600 Gb of base pairs of sequence data within less than 10 days. For most applications, analysis of HTS data starts with read mapping, i.e. nding the locations of these short sequence reads in a reference genome assembly. The similarities between two sequences can be determined by computing their optimal global alignments using a dynamic programming method called the Needleman-Wunsch algorithm. The Needleman-Wunsch algorithm is widely used in hash-based DNA read mapping algorithms because of its guaranteed sensitivity. However, the quadratic time complexity of this algorithm makes it highly timeconsuming and the main bottleneck in analysis. In addition to this drawback, the short length of reads ( 100 base pairs) and the large size of mammalian genomes (3.1 Gbp for human) worsens the situation by requiring several hundreds to tens of thousands of Needleman-Wunsch calculations per read. The fastest approach proposed so far avoids Needleman-Wunsch and maps the data described above in 70 CPU days with lower sensitivity. More sensitive mapping approaches are even slower. We propose that e cient parallel implementations of string comparison will dramatically improve the running time of this process. With this motivation, we propose to develop enhanced algorithms to exploit the parallel architecture of GPUs.