BMC Genomics | |
Predicting the functional repertoire of an organism from unassembled RNA–seq data | |
Peter Meinicke1  Manuel Landesfeind1  | |
[1] Department of Bioinformatics, Institute for Microbiology and Genetics, Georg–August–University, Goldschmidtstraße 1, 37077 Göttingen, DE, Germany | |
关键词: Computational biology; Bioinformatics; Reconstruction of metabolic pathways; Metabolism; RNA–seq; Transcriptomics; | |
Others : 1091909 DOI : 10.1186/1471-2164-15-1003 |
|
received in 2014-09-17, accepted in 2014-10-30, 发布年份 2014 | |
【 摘 要 】
Background
The annotation of biomolecular functions is an essential step in the analysis of newly sequenced organisms. Usually, the functions are inferred from predicted genes on the genome using homology search techniques. A high quality genomic sequence is an important prerequisite which, however, is difficult to achieve for certain organisms, such as hybrids or organisms with a large genome. For functional analysis it is also possible to use a de novo transcriptome assembly but the computational requirements can be demanding. Up to now, it is unclear how much of the functional repertoire of an organism can be reliably predicted from unassembled RNA-seq short reads alone.
Results
We have conducted a study to investigate to what degree it is possible to reconstruct the functional profile of an organism from unassembled transcriptome data. We simulated the de novo prediction of biomolecular functions for Arabidopsis thaliana using a comprehensive RNA-seq data set. We evaluated the prediction performance using several homology search methods in combination with different evidence measures. For the decision on the presence or absence of a particular function under noisy conditions we propose a statistical mixture model enabling unsupervised estimation of a detection threshold. Our results indicate that the prediction of the biomolecular functions from the KEGG database is possible with a high sensitivity up to 94 percent. In this setting, the application of the mixture model for automatic threshold calibration allowed the reduction of the falsely predicted functions down to 4 percent. Furthermore, we found that our statistical approach even outperforms the prediction from a de novo transcriptome assembly.
Conclusion
The analysis of an organism’s transcriptome can provide a solid basis for the prediction of biomolecular functions. Using RNA-seq short reads directly, the functional profile of an organism can be reconstructed in a computationally efficient way to provide a draft annotation in cases where the classical genome-based approaches cannot be applied.
【 授权许可】
2014 Landesfeind and Meinicke; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150128175047370.pdf | 539KB | download | |
Figure 3. | 38KB | Image | download |
Figure 2. | 43KB | Image | download |
Figure 1. | 51KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
【 参考文献 】
- [1]Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28(1):27-30. doi:10.1093/nar/28.1.27
- [2]Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 2012, 40(Database issue):109-114. doi:10.1093/nar/gkr988
- [3]Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, Latendresse M, Paley S, Rhee SY, Shearer AG, Tissier C, Walk TC, Zhang P, Karp PD: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome databases. Nucleic Acids Res 2008, 36(suppl 1):623-631. doi:10.1093/nar/gkm900
- [4]Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe D, Zhang P, Karp PD: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 2012, 40(Database Issue):742-753.
- [5]Karp PD, Paley S, Romero P: The Pathway Tools software. Bioinformatics 2002, 18(Suppl. 1):225.
- [6]Ye Y, Doak TG: A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol 1000465., 5(8) doi:10.1371/journal.pcbi.1000465
- [7]Henry CS, DeJongh M, Best AA, Frybarger PM, Linsay B, Stevens RL: High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol 2010, 28(9):977-982. doi:10.1038/nbt.1672
- [8]Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010, 11:31-46. doi:10.1038/nrg2626
- [9]Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 2008, 18(5):821-829. doi:10.1101/gr.074492.107
- [10]Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res 2009, 19(6):1117-1123. doi:10.1101/gr.089532.108
- [11]Peng Y, Leung HCM, Yiu SM, Chin FYL: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012, 28(11):1420-1428. doi:10.1093/bioinformatics/bts174
- [12]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. doi:10.1093/nar/25.17.3389
- [13]Earl D, Bradnam K, John JS, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung W-K, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011, 21(12):2224-2241. doi:10.1101/gr.126599.111
- [14]Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou W-C, Corbeil J, Fabbro CD, Docking TR, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs RA, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2013, 2(1):10. doi:10.1186/2047-217X-2-10 BioMed Central Full Text
- [15]Pellicer J, Fay MF, Leitch IJ: The largest eukaryotic genome of them all? Bot J Linnean Soc 2010, 164(1):10-15. doi:10.1111/j.1095-8339.2010.01072.x
- [16]Gross B, Rieseberg L: The ecological genetics of homoploid hybrid speciation. J Hered 2005, 96(3):241-252. doi:10.1093/jhered/esi026
- [17]Mallet J: Hybrid speciation. Nature 2007, 446(7133):279-283. doi:10.1038/nature05706
- [18]English AC, Richards S, Han Y, Wang M, Vee V, Qu J, Qin X, Muzny DM, Reid JG, Worley KC, Gibbs RA: Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology.gerstein. PLoS One 2012, 7(11):47768. doi:10.1371/journal.pone.0047768
- [19]Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10:57-63. doi:10.1038/nrg2484
- [20]Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 2004, 5(2):7. doi:10.1186/gb-2004-5-2-r7 BioMed Central Full Text
- [21]Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJM: De novo transcriptome assembly with abyss. Bioinformatics 2009, 25(21):2872-2877. doi:10.1093/bioinformatics/btp367
- [22]Peng Y, Leung HCM, Yiu S-M, Lv M-J, Zhu X-G, Chin FYL: IDBA-tran: a more robust de novo de bruijn graph assembler for transcriptomes with uneven expression levelspis. Bioinformatics 2013, 29(13):326-334. doi:10.1093/bioinformatics/btt219
- [23]Ye Y, Choi J-H, Tang H: RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinformatics 2011, 12:159. doi:10.1186/1471-2105-12-159 BioMed Central Full Text
- [24]Huson DH, Xie C: A poor man’s BLASTX - high-throughput metagenomic protein database search using PAUDA. Bioinformatics 2013, 30:38-39. doi:10.1093/bioinformatics/btt254
- [25]Murphy LR, Wallqvist A, Levy RM: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng 2000, 13(3):149-152. doi:10.1093/protein/13.3.149
- [26]Meinicke P: UProC: tools for ultra-fast protein domain classification. [ http://uproc.gobics.de webcite]], Accessed April 2013
- [27]Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 1977, 39(1):1-38.
- [28]Celeux G, Govaert G: A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 1992, 14(3):315-332. doi:10.1016/0167-9473(92)90042-E
- [29]Celeux G, Diebolt J: The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 1985, 2:73-82.
- [30]Marquez Y, Brown JWS, Simpson C, Barta A, Kalyna M: Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis. Genome Res 2012, 22(6):1184-1195. doi:10.1101/gr.134106.111
- [31]NCBI Sequence Read Archive: Illumina RNA-Seq of Arabidopsis Col-0 to determine Alternative splicing landscape. [ https://www.ncbi.nlm.nih.gov/sra/SRX103665 webcite], Accessed April 2013
- [32]Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27(6):863-864. doi:10.1093/bioinformatics/btr026
- [33]Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Med 2012, 9(4):357-359. doi:10.1038/nmeth.1923
- [34]Fradin EF, Zhang Z, Rovenich H, Song Y, Liebrand TWH, Masini L, van den Berg GCM, Joosten MHAJ, Thomma BPHJ: Functional analysis of the tomato immune receptor Ve1 through domain swaps with its non-functional homolog Ve2. PLoS ONE 2014, 9(2):88208. doi:10.1371/journal.pone.0088208
- [35]Tran V-T, Braus-Stromeyer SA, Kusch H, Reusche M, Kaever A, Kühn A, Valerius O, Landesfeind M, Aßhauer K, Tech M, Hoff K, Pena-Centeno T, Stanke M, Lipka V, Braus GH: Verticillium transcription activator of adhesion vta2 suppresses microsclerotia formation and is required for systemic infection of plant roots. New Phytol 2014, 202(2):565-581. doi:10.1111/nph.12671
- [36]Chatterji S, Yamazaki I, Bai Z, Eisen JA: CompostBin: a DNA, composition-based algorithm for binning environmental shotgun reads. In Research in Computational Molecular Biology. Berlin Heidelberg: Springer; 2008:17-28. doi:10.1007/978-3-540-78839-3_3
- [37]Kislyuk A, Bhatnagar S, Dushoff J, Weitz JS: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 2009, 10:316. doi:10.1186/1471-2105-10-316 BioMed Central Full Text
- [38]Tanaseichuk O, Borneman J, Jiang T: Separating metagenomic short reads into genomes via clustering. Algorithms Mol Biol 2012, 7(1):27. doi:10.1186/1748-7188-7-27 BioMed Central Full Text
- [39]Jenkinson AF: The frequency distribution of the annual maximum (or minimum) values of meteorological elements. Q J R Meteorol Soc 1955, 81(348):158-171. doi:10.1002/qj.49708134804