期刊论文详细信息
BMC Bioinformatics
Seq-ing improved gene expression estimates from microarrays using machine learning
Paul K. Korir3  Paul Geeleher2  Cathal Seoighe1 
[1] Institute of Infectious Disease and Molecular Medicine, Anzio Road, Cape Town 7925, South Africa
[2] Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago IL-60637, USA
[3] School of Biochemistry and Cell Biology, University College Cork, Western Road, Cork, Ireland
关键词: Statistical learning;    Machine learning;    Microarray;    RNA-Seq;   
Others  :  1229466
DOI  :  10.1186/s12859-015-0712-z
 received in 2015-05-05, accepted in 2015-08-19,  发布年份 2015
PDF
【 摘 要 】

Background

Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories.

Results

We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues.

Conclusion

This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated.

【 授权许可】

   
2015 Korir et al.

【 预 览 】
附件列表
Files Size Format View
20151030015429462.pdf 1658KB PDF download
Fig. 5. 72KB Image download
Fig. 4. 44KB Image download
Fig. 3. 58KB Image download
Fig. 2. 76KB Image download
Fig. 1. 48KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

【 参考文献 】
  • [1]Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003; 302(5653):2141-2144.
  • [2]Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research. 2003; 31(4):15-15.
  • [3]Irizarry RA, Wu Z, Jaffee HA. Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2006; 22(7):789-794.
  • [4]Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology. 2005; 6(2):16. BioMed Central Full Text
  • [5]Miller JA, Menon V, Goldy J, Kaykas A, Lee C-K, Smith KA, Shen EH, Phillips JW, Lein ES, Hawrylycz MJ. Improving reliability and absolute quantification of human brain microarray data by filtering and scaling probes using RNA-Seq. BMC genomics. 2014; 15(1):154. BioMed Central Full Text
  • [6]Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R et al.. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics. 2009; 10(1):161. BioMed Central Full Text
  • [7]Mutch DM, Berger A, Mansourian R, Rytz A, Roberts M-A. The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data. BMC Bioinformatics. 2002; 3(1):17. BioMed Central Full Text
  • [8]Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G et al.. Multiple-laboratory comparison of microarray platforms. Nature Methods. 2005; 2(5):345-350.
  • [9]Seita J, Sahoo D, Rossi DJ, Bhattacharya D, Serwold T, Inlay MA, Ehrlich LI, Fathman JW, Dill DL, Weissman IL. Gene expression commons: an open platform for absolute gene expression profiling. PLoS ONE. 2012; 7(7):40321.
  • [10]Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009; 10(1):57-63.
  • [11]Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE et al.. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447(7146):799-816.
  • [12]Meyer S, Fuchs TJ, Bosserhoff AK, Hofstädter F, Pauer A, Roth V, Buhmann JM, Moll I, Anagnostou N, Brandner JM et al.. A seven-marker signature and clinical outcome in malignant melanoma: a large-scale tissue-microarray study with two independent patient cohorts. PLoS ONE. 2012; 7(6):38222.
  • [13]Clarke C, Doolan P, Barron N, Meleady P, O’Sullivan F, Gammell P, Melville M, Leonard M, Clynes M. Large scale microarray profiling and coexpression network analysis of CHO cells identifies transcriptional modules associated with growth and productivity. Journal of Biotechnology. 2011; 155(3):350-359.
  • [14]Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N et al.. The Genotype-Tissue Expression (GTEx) project. Nature Genetics. 2013; 45(6):580-585.
  • [15]Affymetrix. Technical note: guide to probe logarithmic intensity error (PLIER) estimation. Technical report, Affymetrix Inc. 2005. http://www.affymetrix.com/support/technical/technotes/plier_technote.pdf Accessed 2013-04-22.
  • [16]Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S et al.. Ensembl 2012. Nucleic Acids Research. 2012; 40(D1):84-90.
  • [17]Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464(7289):768-772.
  • [18]Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010; 464(7289):773-777.
  • [19]Huang RS, Duan S, Bleibel WK, Kistner EO, Zhang W, Clark TA, Chen TX, Schweitzer AC, Blume JE, Cox NJ et al.. A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity. Proceedings of the National Academy of Sciences. 2007; 104(23):9758-9763.
  • [20]Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F, Yang H, Ch’ang L-Y, Huang W, Liu B, Shen Y et al.. The international HapMap project. Nature. 2003; 426(6968):789-796.
  • [21]Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002; 30(1):207-210.
  • [22]Breiman L. Classification and Regression Trees. CRC Press, Boca Raton; 1993.
  • [23]Friedamn JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991; 19(1):1-141.
  • [24]Breiman L. Random forests. Machine Learning. 2001; 45(1):5-32.
  • [25]Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics. 2006; 15(3):651-674.
  • [26]Meinshausen N. Quantile regression forests. The Journal of Machine Learning Research. 2006; 7:983-999.
  • [27]Team RC. R: A language and environment for statistical computing. ISBN 3-900051-07-0 R Foundation for Statistical Computing Vienna, Austria, 2013. (2005). http://www. r-project.org webcite
  • [28]Mazin P, Xiong J, Liu X, Yan Z, Zhang X, Li M, He L, Somel M, Yuan Y, Phoebe CY et al.. Widespread splicing changes in human brain development and aging. Molecular systems biology. 2013; 9(1):633-633.
  • [29]Affymetrix. Exon Array Whitepaper Collection,“Exon Probeset Annotations and Transcript Cluster Groupings,” rev. Sep. 27, 2005, ver. 1.0. Technical report, Affymetrix Inc. (2005). http://www. affymetrix.com/support/technical/whitepapers/exon_probeset_trans_clust_whitepaper.pdf webcite
  • [30]Yates T, Okoniewski MJ, Miller CJ. X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis. Nucleic Acids Research. 2008; 36(suppl 1):780-786.
  • [31]Gaujoux R, Seoighe C. CellMix: A Comprehensive Toolbox for Gene Expression Deconvolution. Bioinformatics. 2013. doi:10.1093/bioinformatics/btt351.
  • [32]Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008; 18(9):1509-1517.
  • [33]Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008; 5(7):621-628.
  • [34]Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, Selbach M. Global quantification of mammalian gene expression control. Nature. 2011; 473(7347):337-342.
  • [35]Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, Hudson TJ, Sladek R, Majewski J. Genome-wide analysis of transcript isoform variation in humans. Nature Genetics. 2008; 40(2):225-231.
  • [36]Turro E, Lewin A, Rose A, Dallman MJ, Richardson S. MMBGX: a method for estimating expression at the isoform level and detecting differential splicing using whole-transcript Affymetrix arrays. Nucleic Acids Research. 2010; 38(1):4-4.
  • [37]Anton MA, Gorostiaga D, Guruceaga E, Segura V, Carmona-Saez P, Pascual-Montano A, Pio R, Montuenga LM, Rubio A. SPACE: an algorithm to predict and quantify alternatively spliced isoforms using microarrays. Genome Biology. 2008; 9(2):46. BioMed Central Full Text
  • [38]Robinson MD, Speed TP. A comparison of Affymetrix gene expression arrays. BMC Bioinformatics. 2007; 8(1):449. BioMed Central Full Text
  • [39]Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM, Butte AJ. Cell type–specific gene expression differences in complex tissues. Nature methods. 2010; 7(4):287-289.
  • [40]Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nature Reviews Genetics. 2010; 12(2):87-98.
  • [41]Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Research. 2007; 35(suppl 1):760-765.
  文献评价指标  
  下载次数:109次 浏览次数:27次