BMC Bioinformatics | |
PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme | |
Aimin Li1  Junying Zhang2  Zhongyin Zhou3  | |
[1] School of Computer Science and Engineering, Xi’an University of Technology, Xi’an, PR China | |
[2] School of Computer Science and Technology, Xidian University, Xi’an, PR China | |
[3] State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, PR China | |
关键词: de novo assemble; de novo sequencing; Prediction; k-mer; lncRNA; RNA-seq; | |
Others : 1085906 DOI : 10.1186/1471-2105-15-311 |
|
received in 2013-11-18, accepted in 2014-09-01, 发布年份 2014 | |
【 摘 要 】
Background
High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing.
Results
We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner.
Conclusions
PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/ webcite.
【 授权许可】
2014 Li et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150113181502204.pdf | 354KB | download | |
Figure 3. | 53KB | Image | download |
Figure 2. | 43KB | Image | download |
Figure 1. | 55KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
【 参考文献 】
- [1]Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320(5881):1344-1349.
- [2]Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57-63.
- [3]Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, 28(5):511-515.
- [4]Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621-628.
- [5]Flintoft L: Non-coding RNA: Structure and function for lncRNAs. Nat Rev Genet 2013, 14(9):598.
- [6]Mercer TR, Dinger ME, Mattick JS: Long non-coding RNAs: insights into functions. Nat Rev Genet 2009, 10(3):155-159.
- [7]Tripathi V, Shen Z, Chakraborty A, Giri S, Freier SM, Wu X, Zhang Y, Gorospe M, Prasanth SG, Lal A, Prasanth KV: Long noncoding RNA MALAT1 controls cell cycle progression by regulating the expression of oncogenic transcription factor B-MYB. PLoS Genet 2013, 9(3):e1003368.
- [8]Wang X, Arai S, Song X, Reichart D, Du K, Pascual G, Tempst P, Rosenfeld MG, Glass CK, Kurokawa R: Induced ncRNAs allosterically modify RNA-binding proteins in cis to inhibit transcription. Nature 2008, 454(7200):126-130.
- [9]Batista PJ, Chang HY: Long noncoding RNAs: cellular address codes in development and disease. Cell 2013, 152(6):1298-1307.
- [10]Wapinski O, Chang HY: Long noncoding RNAs and human disease. Trends Cell Biol 2011, 21(6):354-361.
- [11]Yang L, Lin C, Jin C, Yang JC, Tanasa B, Li W, Merkurjev D, Ohgi KA, Meng D, Zhang J, Evans CP, Rosenfeld MG: lncRNA-dependent mechanisms of androgen-receptor-regulated gene activation programs. Nature 2013, 500(7464):598-602.
- [12]Schmitt AM, Chang HY: Gene regulation: Long RNAs wire up cancer growth. Nature 2013, 500(7464):536-537.
- [13]Qi P, Du X: The long non-coding RNAs, a new cancer diagnostic and therapeutic gold mine. Mod Pathol 2013, 26(2):155-165.
- [14]Ulitsky I, Bartel David P: LincRNAs: genomics, evolution, and mechanisms. Cell 2013, 154(1):26-46.
- [15]Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G: CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 2007, 35(Web Server issue):W345-W349.
- [16]Lin MF, Jungreis I, Kellis M: PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 2011, 27(13):i275-i282.
- [17]Sun L, Luo H, Bu D, Zhao G, Yu K, Zhang C, Liu Y, Chen R, Zhao Y: Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res 2013, 41(17):e166.
- [18]Meyer M, Stenzel U, Hofreiter M: Parallel tagged sequencing on the 454 platform. Nat Protoc 2008, 3(2):267-278.
- [19]Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ: Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012, 30(5):434-439.
- [20]Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012, 13:341. BioMed Central Full Text
- [21]English AC, Richards S, Han Y, Wang M, Vee V, Qu J, Qin X, Muzny DM, Reid JG, Worley KC, Gibbs RA: Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 2012, 7(11):e47768.
- [22]Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 2012, 40(Database issue):D130-D135.
- [23]Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35(Database issue):D61-D65.
- [24]Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE, Guigo R: GENCODE: producing a reference annotation for ENCODE. Genome Biol 2006, 7 Suppl 1:S4. 1-9
- [25]Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J, Guigo R: The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2012, 22(9):1775-1789.
- [26]Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, et al.: GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 2012, 22(9):1760-1774.
- [27]Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Garcia-Giron C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kahari AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, et al.: Ensembl 2013. Nucleic Acids Res 2013, 41(Database issue):D48-D55.
- [28]Chang C-C, Lin C-J: LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2011, 2(3):1-27.
- [29]Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet 2011, 12(10):671-682.
- [30]Schuster SC: Next-generation sequencing transforms today’s biology. Nat Methods 2008, 5(1):16-18.
- [31]Mason CE, Elemento O: Faster sequencers, larger datasets, new challenges. Genome Biol 2012, 13(3):314. BioMed Central Full Text
- [32]Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 2008, 17(7):1636-1647.
- [33]Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437(7057):376-380.
- [34]Hale MC, McCormick CR, Jackson JR, Dewoody JA: Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery. BMC Genomics 2009, 10:203. BioMed Central Full Text
- [35]Adamidi C, Wang Y, Gruen D, Mastrobuoni G, You X, Tolle D, Dodt M, Mackowiak SD, Gogol-Doering A, Oenal P, Rybak A, Ross E, Sanchez Alvarado A, Kempa S, Dieterich C, Rajewsky N, Chen W: De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. Genome Res 2011, 21(7):1193-1200.
- [36]Zeng S, Xiao G, Guo J, Fei Z, Xu Y, Roe BA, Wang Y: Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim. BMC Genomics 2010, 11:94. BioMed Central Full Text
- [37]Renaut S, Nolte AW, Bernatchez L: Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae). Mol Ecol 2010, 19 Suppl 1:115-131.
- [38]Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, Adam MP: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 2012, 30(7):693-700.
- [39]Luciani F, Bull RA, Lloyd AR: Next generation deep sequencing and vaccine design: today and tomorrow. Trends Biotechnol 2012, 30(9):443-452.
- [40]PacBio blog, data release, human MCF-7 transcriptome [http://blog.pacificbiosciences.com/2013/12/data-release-human-mcf-7-transcriptome.html webcite]
- [41]Tilgner H, Raha D, Habegger L, Mohiuddin M, Gerstein M, Snyder M: Accurate identification and analysis of human mRNA isoforms using deep long read sequencing. Genes Genome Genet 2013, 3(3):387-397.
- [42]Chou H-H, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17(12):1093-1104.
- [43]Tan MH, Au KF, Yablonovitch AL, Wills AE, Chuang J, Baker JC, Wong WH, Li JB: RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development. Genome Res 2013, 23(1):201-216.
- [44]Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):R86. BioMed Central Full Text
- [45]Sigova AA, Mullen AC, Molinie B, Gupta S, Orlando DA, Guenther MG, Almada AE, Lin C, Sharp PA, Giallourakis CC, Young RA: Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells. Proc Natl Acad Sci U S A 2013, 110(8):2876-2881.
- [46]Gao G, Vibranovski MD, Zhang L, Li Z, Liu M, Zhang YE, Li X, Zhang W, Fan Q, Vankuren NW, Long M, Wei L: A long-term demasculinization of X-linked intergenic noncoding RNAs in Drosophila melanogaster. Genome Res 2014, 24(4):629-638.
- [47]Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011, 25(18):1915-1927.
- [48]Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 2010, 28(5):503-510.
- [49]Pauli A, Valen E, Lin MF, Garber M, Vastenhouw NL, Levin JZ, Fan L, Sandelin A, Rinn JL, Regev A, Schier AF: Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. Genome Res 2012, 22(3):577-591.
- [50]Young RS, Marques AC, Tibbit C, Haerty W, Bassett AR, Liu JL, Ponting CP: Identification and properties of 1,119 candidate lincRNA loci in the Drosophila melanogaster genome. Genome Biol Evol 2012, 4(4):427-442.
- [51]Zhou Z-Y, Li A-M, Adeola AC, Liu Y-H, Irwin DM, Xie H-B, Zhang Y-P: Genome-wide identification of long intergenic noncoding RNA genes and their potential association with domestication in pigs. Genome Biol Evol 2014, 6(6):1387-1392.
- [52]Liu Y, Guo J, Hu G, Zhu H: Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinformatics 2013, 14 Suppl 5:S12.
- [53]Zhang Y, Wang X, Kang L: A k-mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics 2011, 27(6):771-776.
- [54]Srinivasan SM, Vural S, King BR, Guda C: Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics 2013, 14:96. BioMed Central Full Text
- [55]Ding J, Zhou S, Guan J: miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM. BMC Bioinformatics 2011, 12:216. BioMed Central Full Text
- [56]Fickett JW, Tung CS: Assessment of protein coding measures. Nucleic Acids Res 1992, 20(24):6441-6450.
- [57]Garcia-Diaz M, Kunkel TA: Mechanism of a genetic glissando: structural biology of indel mutations. Trends Biochem Sci 2006, 31(4):206-214.
- [58]Nam J-W, Bartel DP: Long noncoding RNAs in C. elegans. Genome Res 2012, 22(12):2529-2540.
- [59]Li L, Eichten SR, Shimizu R, Petsch K, Yeh C-T, Wu W, Chettoor AM, Givan SA, Cole RA, Fowler JE: Genome-wide discovery and characterization of maize long non-coding RNAs. Genome Biol 2014, 15(2):R40. BioMed Central Full Text