Show simple item record

dc.contributor.authorHach F.en_US
dc.contributor.authorNumanagić I.en_US
dc.contributor.authorAlkan, C.en_US
dc.contributor.authorSahinalp, S.C.en_US
dc.date.accessioned2016-02-08T09:43:11Z
dc.date.available2016-02-08T09:43:11Z
dc.date.issued2012en_US
dc.identifier.issn13674803en_US
dc.identifier.urihttp://hdl.handle.net/11693/21215
dc.description.abstractMotivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Datamanagement, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by theHTSplatforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically forHTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome. Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19-when the goal is to compress the reads alone. In fact, on SCALCE reordered reads gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE \+ gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2 SCALCE\+gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time. © The Author 2012. Published by Oxford University Press. All rights reserved.en_US
dc.language.isoEnglishen_US
dc.source.titleBioinformaticsen_US
dc.relation.isversionofhttp://dx.doi.org/10.1093/bioinformatics/bts593en_US
dc.subjectalgorithmen_US
dc.subjectarticleen_US
dc.subjectbiologyen_US
dc.subjectcomputer programen_US
dc.subjectgeneticsen_US
dc.subjectgenomeen_US
dc.subjectgenomicsen_US
dc.subjecthigh throughput sequencingen_US
dc.subjecthumanen_US
dc.subjectinformation processingen_US
dc.subjectmethodologyen_US
dc.subjectPseudomonas aeruginosaen_US
dc.subjectsequence alignmenten_US
dc.subjectAlgorithmsen_US
dc.subjectComputational Biologyen_US
dc.subjectData Compressionen_US
dc.subjectGenomeen_US
dc.subjectGenomicsen_US
dc.subjectHigh-Throughput Nucleotide Sequencingen_US
dc.subjectHumansen_US
dc.subjectPseudomonas aeruginosaen_US
dc.subjectSequence Alignmenten_US
dc.subjectSoftwareen_US
dc.titleSCALCE: Boosting sequence compression algorithms using locally consistent encodingen_US
dc.typeArticleen_US
dc.departmentDepartment of Computer Engineering
dc.citation.spage3051en_US
dc.citation.epage3057en_US
dc.citation.volumeNumber28en_US
dc.citation.issueNumber23en_US
dc.identifier.doi10.1093/bioinformatics/bts593en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record