期刊论文详细信息
BMC Genomics
Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi
Shifra Ben-Dor3  Assaf Vardi1  Shilo Rosenwasser1  Ester Feldmesser2 
[1] Department of Plant Sciences, Weizmann Institute of Science, Rehovot 76100, Israel;Nancy and Stephen Grand Israel National Center for Personalized Medicine, Weizmann Institute of Science, Rehovot 76100, Israel;Department of Biological Services, Weizmann Institute of Science, Rehovot 76100, Israel
关键词: Emilania huxleyi;    Manual curation;    Transcriptome assembly;    Non-model organism;    RNAseq;   
Others  :  1217848
DOI  :  10.1186/1471-2164-15-148
 received in 2013-08-29, accepted in 2014-02-17,  发布年份 2014
PDF
【 摘 要 】

Background

The advent of Next Generation Sequencing technologies and corresponding bioinformatics tools allows the definition of transcriptomes in non-model organisms. Non-model organisms are of great ecological and biotechnological significance, and consequently the understanding of their unique metabolic pathways is essential. Several methods that integrate de novo assembly with genome-based assembly have been proposed. Yet, there are many open challenges in defining genes, particularly where genomes are not available or incomplete. Despite the large numbers of transcriptome assemblies that have been performed, quality control of the transcript building process, particularly on the protein level, is rarely performed if ever. To test and improve the quality of the automated transcriptome reconstruction, we used manually defined and curated genes, several of them experimentally validated.

Results

Several approaches to transcript construction were utilized, based on the available data: a draft genome, high quality RNAseq reads, and ESTs. In order to maximize the contribution of the various data, we integrated methods including de novo and genome based assembly, as well as EST clustering. After each step a set of manually curated genes was used for quality assessment of the transcripts. The interplay between the automated pipeline and the quality control indicated which additional processes were required to improve the transcriptome reconstruction. We discovered that E. huxleyi has a very high percentage of non-canonical splice junctions, and relatively high rates of intron retention, which caused unique issues with the currently available tools. While individual tools missed genes and artificially joined overlapping transcripts, combining the results of several tools improved the completeness and quality considerably. The final collection, created from the integration of several quality control and improvement rounds, was compared to the manually defined set both on the DNA and protein levels, and resulted in an improvement of 20% versus any of the read-based approaches alone.

Conclusions

To the best of our knowledge, this is the first time that an automated transcript definition is subjected to quality control using manually defined and curated genes and thereafter the process is improved. We recommend using a set of manually curated genes to troubleshoot transcriptome reconstruction.

【 授权许可】

   
2014 Feldmesser et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150708182401954.pdf 2370KB PDF download
Figure 7. 100KB Image download
Figure 6. 86KB Image download
Figure 5. 120KB Image download
Figure 4. 97KB Image download
Figure 3. 85KB Image download
Figure 2. 78KB Image download
Figure 1. 72KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

【 参考文献 】
  • [1]Schliesky S, Gowik U, Weber AP, Bräutigam A: RNA-Seq assembly - are we there yet? Front Plant Sci 2012, 3:220.
  • [2]Johnson MT, Carpenter EJ, Tian Z, Bruskiewich R, Burris JN, Carrigan CT, Chase MW, Clarke ND, Covshoff S, Depamphilis CW, Edger PP, Goh F, Graham S, Greiner S, Hibberd JM, Jordon-Thaden I, Kutchan TM, Leebens-Mack J, Melkonian M, Miles N, Myburg H, Patterson J, Pires JC, Ralph P, Rolf M, Sage RF, Soltis D, Soltis P, Stevenson D, Stewart CN Jr, et al.: Evaluating methods for isolating total RNA and predicting the success of sequencing phylogenetically diverse plant transcriptomes. PLoS One 2012, 7(11):e50226.
  • [3]Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet 2011, 12(10):671-682.
  • [4]Holligan PM, Viollier M, Harbour DS, Camus P, Champagne-Philippe M: Satellite and ship studies of coccolithophore production along a continental shelf edge. Nature 1983, 304(5924):339-342.
  • [5]Balch WM: Re-evaluation of the Physiological Ecology of Coccolithophores, Volume XIII. Berlin: Springer-Verlag; 2004.
  • [6]Beaufort L, Probert I, de Garidel-Thoron T, Bendif EM, Ruiz-Pino D, Metzl N, Goyet C, Buchet N, Coupel P, Grelaud M, Rost B, Rickaby REM, de Vargas C: Sensitivity of coccolithophores to carbonate chemistry and ocean acidification. Nature 2011, 476(7358):80-83.
  • [7]Simo R: Production of atmospheric sulfur by oceanic plankton: biogeochemical, ecological and evolutionary links. Trends Ecol Evol 2001, 16:287-294.
  • [8]Read BA, Kegel J, Klute MJ, Kuo A, Lefebvre SC, Maumus F, Mayer C, Miller J, Monier A, Salamov A, Young J, Aguilar M, Claverie J-M, Frickenhaus S, Gonzalez K, Herman EK, Lin Y-C, Napier J, Ogata H, Sarno AF, Shmutz J, Schroeder D, de Vargas C, Verret F, von Dassow P, Valentin K, Van de Peer Y, Wheeler G, Dacks JB, Emiliania huxleyi Annotation Consortium, et al.: Pan genome of the phytoplankton Emiliania underpins its global distribution. Nature 2013, 499(7457):209-213.
  • [9]Bidle KD, Haramaty L, Barcelos E, Ramos J, Falkowski P: Viral activation and recruitment of metacaspases in the unicellular coccolithophore, Emiliania huxleyi. Proc Natl Acad Sci USA 2007, 104(14):6049-6054.
  • [10]Vardi A, Van Mooy BA, Fredricks HF, Popendorf KJ, Ossolinski JE, Haramaty L, Bidle KD: Viral glycosphingolipids induce lytic infection and cell death in marine phytoplankton. Science 2009, 326(5954):861-865.
  • [11]Vardi A, Haramaty L, Van Mooy BA, Fredricks HF, Kimmance SA, Larsen A, Bidle KD: Host-virus dynamics and subcellular controls of cell fate in a natural coccolithophore population. Proc Natl Acad Sci USA 2012, 109(47):19327-19332.
  • [12]Huang X, Madan A: CAP3: a DNA sequence assembly program. Genome Res 1999, 9(9):868-877.
  • [13]Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M: Next generation sequence assembly with AMOS. Curr Protoc Bioinform 2011, 33:11.8.1-11.8.18.
  • [14]Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105-1111.
  • [15]Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, 28(5):511-515.
  • [16]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402.
  • [17]Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res 2002, 12(4):656-664.
  • [18]Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658-1659.
  • [19]Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol 2011, 29(1):24-26.
  • [20]The Gene Ontology Consortium: The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res 2010, 38(suppl 1):D331-D335.
  • [21]Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23(21):2947-2948.
  • [22]Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011, 29(7):644-652.
  • [23]Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu S, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T, Wang J: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 2012, 1(1):18. BioMed Central Full Text
  • [24]Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28(8):1086-1092.
  • [25]Bochenek M, Etherington GJ, Koprivova A, Mugford ST, Bell TG, Malin G, Kopriva S: Transcriptome analysis of the sulfate deficiency response in the marine microalga Emiliania huxleyi. New Phytol 2013, 199(3):650-662.
  • [26]Martin M: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 2011, 10-12. [vol. 17: EMBnet, Volume 17]
  • [27]Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology. Nucleic Acids Res 2003, 31(1):28-33.
  • [28]Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Res 2005, 33(Web Server issue):W116-12.
  • [29]Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res 2012, 40(D1):D290-D301.
  • [30]Sigrist CJ, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I: New and continuing developments at PROSITE. Nucleic Acids Res 2013, 41(Database issue):D344-347.
  • [31]Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792-1797.
  • [32]Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author Department of Genome Sciences, University of Washington, Seattle 2005.
  • [33]Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21(18):3674-3676.
  • [34]Wu S, Zhu Z, Fu L, Niu B, Li W: WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics 2011, 12:444. BioMed Central Full Text
  文献评价指标  
  下载次数:29次 浏览次数:3次