| Journal of computational biology: A journal of computational molecular cell biology | |
| A Two-Level Scheme for Quality Score Compression | |
| JanVoges^1,21  Muhammed OğuzhanKülekci^4,52  JörnOstermann^23  AliFotouhi^35  | |
| [1] Address correspondence to:Jan VogesLeibniz Universität HannoverInstitut für InformationsverarbeitungAppelstr. 9AHannover 30167Germany^1;Assoc. Prof. Muhammed Oğuzhan KülekciInformatics InstituteIstanbul Technical UniversityIstanbul 34469Turkey^4;Electronics and Communication Engineering Department, Istanbul Technical University, Istanbul, Turkey^3;Informatics Institute, Istanbul Technical University, Istanbul, Turkey^5;Institut für Informationsverarbeitung, Leibniz Universität Hannover, Hannover, Germany^2 | |
| 关键词: quality score compression; variant calling; genomic data management; lossless data compression; lossy data compression; high-throughput sequencing; | |
| DOI : 10.1089/cmb.2018.0065 | |
| 学科分类:生物科学(综合) | |
| 来源: Mary Ann Liebert, Inc. Publishers | |
PDF
|
|
【 摘 要 】
Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part—spending only 0.49 bits per quality score on average—shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
【 授权许可】
Unknown
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| RO201910259503688ZK.pdf | 693KB |
PDF