Browsing by Subject "Genome analysis"

Now showing 1 - 8 of 8

Open Access
BISER: fast characterization of segmental duplication structure in multiple genome assemblies
(Schloss Dagstuhl- Leibniz-Zentrum fur Informatik, 2021-07-22) Išerić, Hamza; Alkan, Can; Hach, Faraz; Numanagić, Ibrahim; Carbone, Alessandra; El-Kebir, Mohammed
The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural parts, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure. However, optimal computation of SDs through standard local alignment algorithms is impractical due to the size of most genomes. A cross-genome evolutionary analysis of SDs is even harder, as one needs to characterize SDs in multiple genomes and find relations between those SDs and unique segments in other genomes. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Here we introduce a new tool, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology (75%) to multiple genomes while introducing further 8-24x speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 90 million years.
Open Access
Differential privacy under dependent tuples—the case of genomic privacy
(Oxford University Press, 2020-03) Almadhoun, Nour; Ayday, Erman; Ulusoy, Özgür
Motivation: The rapid progress in genome sequencing has led to high availability of genomic data. Studying these data can greatly help answer the key questions about disease associations and our evolution. However, due to growing privacy concerns about the sensitive information of participants, accessing key results and data of genomic studies (such as genome-wide association studies) is restricted to only trusted individuals. On the other hand, paving the way to biomedical breakthroughs and discoveries requires granting open access to genomic datasets. Privacy-preserving mechanisms can be a solution for granting wider access to such data while protecting their owners. In particular, there has been growing interest in applying the concept of differential privacy (DP) while sharing summary statistics about genomic data. DP provides a mathematically rigorous approach to prevent the risk of membership inference while sharing statistical information about a dataset. However, DP does not consider the dependence between tuples in the dataset, which may degrade the privacy guarantees offered by the DP. Results: In this work, focusing on genomic datasets, we show this drawback of the DP and we propose techniques to mitigate it. First, using a real-world genomic dataset, we demonstrate the feasibility of an inference attack on differentially private query results by utilizing the correlations between the entries in the dataset. The results show the scale of vulnerability when we have dependent tuples in the dataset. We show that the adversary can infer sensitive genomic data about a user from the differentially private results of a query by exploiting the correlations between the genomes of family members. Second, we propose a mechanism for privacy-preserving sharing of statistics from genomic datasets to attain privacy guarantees while taking into consideration the dependence between tuples. By evaluating our mechanism on different genomic datasets, we empirically demonstrate that our proposed mechanism can achieve up to 50% better privacy than traditional DP-based solutions. Availability and implementation: https://github.com/nourmadhoun/Differential-privacy-genomic-inference-attack.
Open Access
Early postzygotic mutations contribute to de novo variation in a healthy monozygotic twin pair
(B M J Group, 2014) Dal, G. M.; Ergüner, B.; Saǧıroǧlu, M. S.; Yüksel, B.; Onat, O. E.; Alkan C.; Özçelik, T.
Background: Human de novo single-nucleotide variation (SNV) rate is estimated to range between 0.82-1.70×10-8 mutations per base per generation. However, contribution of early postzygotic mutations to the overall human de novo SNV rate is unknown. Methods: We performed deep whole-genome sequencing (more than 30-fold coverage per individual) of the whole-blood-derived DNA samples of a healthy monozygotic twin pair and their parents. We examined the genotypes of each individual simultaneously for each of the SNVs and discovered de novo SNVs regarding the timing of mutagenesis. Putative de novo SNVs were validated using Sanger-based capillary sequencing. Results: We conservatively characterised 23 de novo SNVs shared by the twin pair, 8 de novo SNVs specific to twin I and 1 de novo SNV specific to twin II. Based on the number of de novo SNVs validated by Sanger sequencing and the number of callable bases of each twin, we calculated the overall de novo SNV rate of 1.31×10-8 and 1.01×10-8 for twin I and twin II, respectively. Of these, rates of the early postzygotic de novo SNVs were estimated to be 0.34×10-8 for twin I and 0.04×10-8 for twin II. Conclusions: Early postzygotic mutations constitute a substantial proportion of de novo mutations in humans. Therefore, genome mosaicism resulting from early mitotic events during embryogenesis is common and could substantially contribute to the development of diseases.
Open Access
Fast characterization of segmental duplication structure in multiple genome assemblies
(BioMed Central Ltd, 2022-12) Išerić, Hamza; Alkan, Can; Hach, Faraz; Numanagić, Ibrahim
Motivation: The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Results: Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7–33× speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years. Availability and implementation: BISER is implemented in Seq programming language and is publicly available at https://github.com/0xTCG/biser. © 2022, The Author(s).
Open Access
Genetics and epigenetics of liver cancer
(Elsevier, 2013) Özen, Çiğdem; Yıldız, Gökhan; Dağcan, Alper Tunga; Çevik, Dilek; Örs, Ayşegül; Keleş, Umut; Topel, Hande; Öztürk, Mehmet
Hepatocellular carcinoma (HCC) represents a major form of primary liver cancer in adults. Chronic infections with hepatitis B (HBV) and C (HCV) viruses and alcohol abuse are the major factors leading to HCC. This deadly cancer affects more than 500,000 people worldwide and it is quite resistant to conventional chemo- and radiotherapy. Genetic and epigenetic studies on HCC may help to understand better its mechanisms and provide new tools for early diagnosis and therapy. Recent literature on whole genome analysis of HCC indicated a high number of mutated genes in addition to well-known genes such as TP53, CTNNB1, AXIN1 and CDKN2A, but their frequencies are much lower. Apart from CTNNB1 mutations, most of the other mutations appear to result in loss-of-function. Thus, HCC-associated mutations cannot be easily targeted for therapy. Epigenetic aberrations that appear to occur quite frequently may serve as new targets. Global DNA hypomethylation, promoter methylation, aberrant expression of non-coding RNAs and dysregulated expression of other epigenetic regulatory genes such as EZH2 are the best-known epigenetic abnormalities. Future research in this direction may help to identify novel biomarkers and therapeutic targets for HCC.
Open Access
Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions
(Oxford University Press, 2018-04) Cali, D. S.; Kim, J. S.; Ghose, S.; Alkan, Can; Mutlu, O.
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.
Open Access
An ontology for collaborative construction and analysis of cellular pathways
(Oxford University Press, 2004-02-12) Demir, Emek; Babur, Özgün; Doğrusöz, Uğur; Gürsoy, Atilla; Ayaz, Aslı; Güleşır, Gürcan; Nişancı, Gürkan; Çetin Atalay, Rengül
Motivation: As the scientific curiosity in genome studies shifts toward identification of functions of the genomes in large scale, data produced about cellular processes at molecular level has been accumulating with an accelerating rate. In this regard, it is essential to be able to store, integrate, access and analyze this data effectively with the help of software tools. Clearly this requires a strong ontology that is intuitive, comprehensive and uncomplicated. Results: We define an ontology for an intuitive, comprehensive and uncomplicated representation of cellular events. The ontology presented here enables integration of fragmented or incomplete pathway information via collaboration, and supports manipulation of the stored data. In addition, it facilitates concurrent modifications to the data while maintaining its validity and consistency. Furthermore, novel structures for representation of multiple levels of abstraction for pathways and homologies is provided. Lastly, our ontology supports efficient querying of large amounts of data. We have also developed a software tool named pathway analysis tool for integration and knowledge acquisition (PATIKA) providing an integrated, multi-user environment for visualizing and manipulating network of cellular events. PATIKA implements the basics of our ontology. © Oxford University Press 2004; All rights reserved.
Open Access
SeGraM: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping
(Association for Computing Machinery, 2020-06-11) Cali, D.Ş; Kanellopoulos, K.; Lindegger, J.; Bingöl, Zülal; Kalsi, G.S.; Zuo, Z.; Fırtına, Can; Cavlak, M.B.; Kim, J.; Ghiasi, N.M.; Singh, G.; Gómez-Luna, J.; Almadhoun Alserr, N.; Alser, M.; Subramoney, S.; Alkan, Can; Ghose, S.; Mutlu, O.
A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.