期刊论文详细信息
Human Genomics
A survey of computational tools for downstream analysis of proteomic and other omic datasets
Lawrence E. Hunter1  L. Elaine Epperson2  Anis Karimpour-Fard1 
[1] Department of Pharmacology, University of Colorado School of Medicine, Aurora 80045, CO, USA;Integrated Center for Genes, Environment, and Health, National Jewish Health, Denver 80206, CO, USA
关键词: Proteomics repository;    SVM;    PCA;    PLS;    Random forests;    Machine learning;    Proteomics;   
Others  :  1232636
DOI  :  10.1186/s40246-015-0050-2
 received in 2015-07-28, accepted in 2015-10-06,  发布年份 2015
PDF
【 摘 要 】

Proteomics is an expanding area of research into biological systems with significance for biomedical and therapeutic applications ranging from understanding the molecular basis of diseases to testing new treatments, studying the toxicity of drugs, or biotechnological improvements in agriculture. Progress in proteomic technologies and growing interest has resulted in rapid accumulation of proteomic data, and consequently, a great number of tools have become available. In this paper, we review the well-known and ready-to-use tools for classification, clustering and validation, interpretation, and generation of biological information from experimental data. We suggest some rules of thumb for the reader on choosing the best suitable learning method for a particular dataset and conclude with pathway and functional analysis and then provide information about submitting final results to a repository.

【 授权许可】

   
2015 Karimpour-Fard et al.

【 预 览 】
附件列表
Files Size Format View
20151115101339975.pdf 482KB PDF download
【 参考文献 】
  • [1]Hanash S: Disease proteomics. Nature 2003, 422(6928):226-232.
  • [2]Fliser D, Novak J, Thongboonkerd V, Argilés A, Jankowski V, Girolami MA, et al.: Advances in urinary proteome analysis and biomarker discovery. J Am Soc Nephrol 2007, 18:1057-71.
  • [3]McGregor E, Dunn MJ: Proteomics of the heart: unraveling disease. Circ Res 2006, 98:309-21.
  • [4]Wang H, Wu K, Liu Y, Wu Y, Wang X: Integrative proteomics to understand the transmission mechanism of Barley yellow dwarf virus-GPV by its insect vector Rhopalosiphum padi. Sci Rep 2015, 5:10971.
  • [5]Liu W, Gray S, Huo Y, Li L, Wei T, Wang X: Proteomic analysis of interaction between a plant virus and its vector insect reveals new functions of hemipteran cuticular protein. Mol Cell Proteomics 2015, 14:2229-42.
  • [6]Komatsu S, Mock H-P, Yang P, Svensson B: Application of proteomics for improving crop protection/artificial regulation. Front Plant Sci 2013, 4:522.
  • [7]Dajana G-S, Kova S, JosiC D. Application of proteomics in food technology and food biotechnology: process development, quality control and product safety.
  • [8]Huang S-H, Triche T, Jong AY: Infectomics: genomics and proteomics of microbial infections. Funct Integr Genomics 2002, 1:331-44.
  • [9]Swan AL, Mobasheri A, Allaway D, Liddell S, Bacardit J: Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics 2013, 17(12):595-610.
  • [10]Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 2012, 8(2):e1002375.
  • [11]Epperson LE, Martin SL: Proteomic strategies to investigate adaptive processes. In Methods in animal proteomics. Edited by Eckersall PD, Whitfield PD. Wiley-Blackwell, Oxford; 2011.
  • [12]González-Fernández R, Jorrín-Novo JV. Proteomics of fungal plant pathogens: the case of Botrytis cinerea. In. Current research, technology and education topics in applied microbiology and microbial biotechnology. 2010.
  • [13]Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995, 57:289-300.
  • [14]Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software. ACM SIGKDD Explor Newsl 2009, 11:10.
  • [15]scikit-learn.. http://scikit-learn.org/stable/ webcite
  • [16]Sonnenburg S, Rätsch G, Henschel S, Widmer C, Behr J, Zien A, et al.: The SHOGUN machine learning toolbox. J Mach Learn Res 2010, 11:1799-1802.
  • [17]The R project for statistical computing.. https://www.r-project.org/ webcite
  • [18]Tan P-N, Steinbach M, Kumar V: Introduction to data mining. 2996.
  • [19]Wolpert DH, Macready WG: Coevolutionary free lunches. IEEE Trans Evol Comput 2005, 9:721-735.
  • [20]Wolpert DH: The lack of a priori distinctions between learning algorithms. Neural Comput 1996, 8:1341-1390.
  • [21]Jolliffe IT: Principal component analysis, second edition. Encycl Stat Behav Sci 2002, 30:487.
  • [22]Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, et al.: Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci 1998, 95:334-339.
  • [23]Purohit PV, Rocke DM: Discriminant models for high-throughput proteomics mass spectrometer data. Proteomics 2003, 3:1699-1703.
  • [24]Fearn T: Principal component discriminant analysis. Stat Appl Genet Mol Biol 2008, 7:Article6.
  • [25]Hoefsloot HCJ, Smit S, Smilde AK: A classification model for the Leiden proteomics competition. Stat Appl Genet Mol Biol 2008, 7:Article8.
  • [26]Jutten C, Herault J: Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture. Signal Process 1991, 24:1-10.
  • [27]Comon P: Independent component analysis, a new concept? Signal Process 1994, 36:287-314.
  • [28]Safavi H, Correa N, Xiong W, Roy A, Adali T, Korostyshevskiy VR, et al.: Independent component analysis of 2-D electrophoresis gels. Electrophoresis 2008, 29:4017-4026.
  • [29]Hilario M, Kalousis A, Pellegrini C, Müller M: Processing and classification of protein mass spectra. Mass Spectrom Rev 2006, 25:409-449.
  • [30]Rodríguez-Piñeiro AM, Carvajal-Rodríguez A, Rolán-Alvarez E, Rodríguez-Berrocal FJ, Martínez-Fernández M, De Páez La Cadena M: Application of relative warp analysis to the evaluation of two-dimensional gels in proteomics: studying isoelectric point and relative molecular mass variation. J Proteome Res 2005, 4:1318-1323.
  • [31]Jain AK, Dubes RC. Algorithms for clustering data. 1988.
  • [32]MacQueen J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. The Regents of the University of California. 1967.
  • [33]Pham DT, Dimov SSNC: Selection of k in K-means clustering. Mech Eng Sci 2004, 219:103-119.
  • [34]Hindle AG, Karimpour-Fard A, Epperson LE, Hunter LE, Martin SL: Skeletal muscle proteomics: carbohydrate metabolism oscillates with seasonal and torpor-arousal physiology of hibernation. Am J Physiol Regul Integr Comp Physiol 2011, 301:R1440-52.
  • [35]Jani A, Orlicky DJ, Karimpour-Fard A, Epperson LE, Russell RL, Hunter LE, et al.: Kidney proteome changes provide evidence for a dynamic metabolism and regional redistribution of plasma proteins during torpor-arousal cycles of hibernation. Physiol Genomics 2012, 44:717-27.
  • [36]Meunier B, Dumas E, Piec I, Béchet D, Hébraud M, Hocquette JF: Assessment of hierarchical clustering methodologies for proteomic data mining. J Proteome Res 2007, 6:358-366.
  • [37]Laville E, Sayd T, Morzel M, Blinet S, Chambon C, Lepetit J, et al.: Proteome changes during meat aging in tough and tender beef suggest the importance of apoptosis and protein solubility for beef aging and tenderization. J Agric Food Chem 2009, 57:10755-10764.
  • [38]Jacobsen S, Grove H, Jensen KN, Sørensen HA, Jessen F, Hollung K, et al.: Multivariate analysis of 2-DE protein patterns - practical approaches. Electrophoresis 2007, 28:1289-1299.
  • [39]Maurer MH, Feldmann RE, Brömme JO, Kalenka A: Comparison of statistical approaches for the analysis of proteome expression data of differentiating neural stem cells. J Proteome Res 2005, 4:96-100.
  • [40]Wold S, Albano C, Dunn WJ III, Edlund U, Esbensen K, Geladi P, et al.: Chemometrics. Springer, Netherlands; 1984.
  • [41]Helland IS: Partial least squares regression and statistical models. Scandinavian Journal of Statistics. Wiley. 1990, 17(2):97-114.
  • [42]Helland IS: On the structure of partial least squares regression. Commun Stat - Simul Comput 1988, 17:581-607.
  • [43]Nguyen DV, Rocke DM: Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 2002, 18:1625-32.
  • [44]Tan Y, Shi L, Tong W, Hwang GTG, Wang C: Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models. Comput Biol Chem 2004, 28:235-44.
  • [45]Boulesteix A-L, Porzelius C, Daumer M: Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics 2008, 24:1698-706.
  • [46]Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr KM, Kvalheim OM: Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal Chem 2009, 81:2581-2590.
  • [47]Karp NA, Griffin JL, Lilley KS: Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics. Proteomics 2005, 5:81-90.
  • [48]Rosenberg LH, Franzén B, Auer G, Lehtiö J, Forshed J: Multivariate meta-analysis of proteomics data from human prostate and colon tumours. BMC Bioinformatics 2010, 11:468. BioMed Central Full Text
  • [49]Azimi A, Pernemalm M, Frostvik Stolt M, Hansson J, Lehtiö J, Egyházi Brage S, et al.: Proteomics analysis of melanoma metastases: association between S100A13 expression and chemotherapy resistance. Br J Cancer 2014, 110(10):2489-95.
  • [50]Breiman L: Random Forests. Mach Learn 2001, 45(1):5-32.
  • [51]Izmirlian G: Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann N Y Acad Sci 2004, 1020:154-74.
  • [52]Barrett JH, Cairns DA: Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls. Stat Appl Genet Mol Biol 2008, 7:Article4.
  • [53]Hindle AG, Grabek KR, Epperson LE, Karimpour-Fard A, Martin SL: Metabolic changes associated with the long winter fast dominate the liver proteome in 13-lined ground squirrels. Physiol Genomics 2014, 46:348-61.
  • [54]Epperson LE, Karimpour-Fard A, Hunter LE, Martin SL: Metabolic cycles in a circannual hibernator. Physiol Genomics 2011, 43:799-807.
  • [55]Breiman L: Bagging predictors. Mach Learn 1996, 24:123-140.
  • [56]Cortes C, Vapnik V: Support-vector networks. Mach Learn 1995, 20:273-297.
  • [57]Zhang X, Lu X, Shi Q, Xu X-Q, Leung H-CE, Harris LN, et al.: Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 2006, 7:197. BioMed Central Full Text
  • [58]Smith FM, Gallagher WM, Fox E, Stephens RB, Rexhepaj E, Petricoin EF, et al.: Combination of SELDI-TOF-MS and data mining provides early-stage response prediction for rectal tumors undergoing multimodal neoadjuvant therapy. Ann Surg 2007, 245:259-266.
  • [59]Hart TC, Corby PM, Hauskrecht M, Hee Ryu O, Pelikan R, Valko M, et al.: Identification of microbial and proteomic biomarkers in early childhood cCaries. Int J Dent 2011, 2011:196721.
  • [60]Zhai X, Yu J, Lin C, Wang L, Zheng S: Combining proteomics, serum biomarkers and bioinformatics to discriminate between esophageal squamous cell carcinoma and pre-cancerous lesion. J Zhejiang Univ Sci B 2012, 13:964-71.
  • [61]Magni P, Ferrazzi F, Sacchi L, Bellazzi R: TimeClust: a clustering tool for gene expression time series. Bioinformatics 2008, 24:430-2.
  • [62]Conesa A, Nueda MJ, Ferrer A, Talón M: maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics 2006, 22:1096-102.
  • [63]Tai Y. timecourse: statistical analysis for developmental microarray time course data. 2007.
  • [64]Pedro Cardoso, Francois Rigal JCC. BAT. R Package.
  • [65]Aryee M: betr: identify differentially expressed genes in microarray time-course data. R 2011.
  • [66]Peng J. fpca: restricted MLE for functional principal components analysis. R Package.
  • [67]Martini P, Sales G, Calura E, Cagnin S, Chiogna M, Romualdi C: timeClip: pathway analysis for time course data without replicates. BMC Bioinformatics 2014, 15(Suppl 5):S3. BioMed Central Full Text
  • [68]Sangurdekar D. Rnits: R normalization and inference of time series data.
  • [69]Cameletti M. STEM. R Package.
  • [70]Schilling R, Costa IG, Schliep A: pGQL: a probabilistic graphical query language for gene expression time courses. BioData Min 2011, 4:9. BioMed Central Full Text
  • [71]Sinha A, Markatou M: A platform for processing expression of short time series (PESTS). BMC Bioinformatics 2011, 12:13. BioMed Central Full Text
  • [72]Tchagang AB, Phan S, Famili F, Shearer H, Fobert P, Huang Y, et al.: Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm. BMC Bioinformatics 2012, 13:54. BioMed Central Full Text
  • [73]Sivriver J, Habib N, Friedman N: An integrative clustering and modeling algorithm for dynamical gene expression data. Bioinformatics 2011, 27:i392-400.
  • [74]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-9.
  • [75]Bassel GW, Glaab E, Marquez J, Holdsworth MJ, Bacardit J: Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets. Plant Cell 2011, 23:3101-16.
  • [76]Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28:27-30.
  • [77]Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102:15545-50.
  • [78]IPA.. http://www.ingenuity.com/products/ipa webcite
  • [79]Smoot ME, Ono K, Ruscheinski J, Wang P-L, Ideker T: Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 2011, 27:431-2.
  • [80]Pathway Commons. A resource for biological pathway analysis.. http://www.pathwaycommons.org/about/ webcite
  • [81]HumanCyc. Encyclopedia of human genes and metabolism.. http://humancyc.org/ webcite
  • [82]PathVisio - pathway drawing and pathway analysis tool.. http://www.pathvisio.org/ webcite
  • [83]3Omics. A web based systems biology visualization tool for integrating human transcriptomic, proteomic and metabolomic data.. http://3omics.cmdm.tw/ webcite
  • [84]Chang JT, Nevins JR: GATHER: a systems approach to interpreting genomic signatures. Bioinformatics 2006, 22:2926-33.
  • [85]PANTHER - gene list analysis.. http://pantherdb.org/ webcite
  • [86]Wu X, Al Hasan M, Chen JY: Pathway and network analysis in proteomics. J Theor Biol 2014, 362:44-52.
  • [87]Webber J, Stone TC, Katilius E, Smith BC, Gordon B, Mason MD, et al.: Proteomics analysis of cancer exosomes using a novel modified aptamer-based array (SOMAscan™) platform. Mol Cell Proteomics 2014, 13:1050-64.
  • [88]Pride.. http://www.ebi.ac.uk/pride/archive/ webcite
  • [89]Peptideatlas.. http://www.peptideatlas.org/ webcite
  • [90]Deutsch EW, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, et al.: A guided tour of the Trans-Proteomic Pipeline. Proteomics 2010, 10:1150-9.
  • [91]Welcome to MassIVE.. http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp webcite
  • [92]CCMS The Center for Computational Mass Spectrometry.. http://proteomics.ucsd.edu/ webcite
  • [93]Chorus - Home.. https://chorusproject.org/pages/index.html webcite
  • [94]GPMdb.. http://omictools.com/gpmdb-s3019.html webcite
  • [95]ProteomeXchange.. http://www.proteomexchange.org/ webcite
  • [96]Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4:P3. BioMed Central Full Text
  • [97]Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, et al.: PID: the Pathway Interaction Database. Nucleic Acids Res 2009, 37(Database issue):D674-9.
  • [98]Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13:2498-504.
  • [99]Chen JY, Mamidipalli S, Huan T: HAPPI: an online database of comprehensive human annotated and predicted protein interactions. BMC Genomics 2009, 10(Suppl 1):S16. BioMed Central Full Text
  • [100]Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, et al.: Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 2009, 37(Database issue):D619-22.
  • [101]Nishimura D: BioCarta. Biotech Softw Internet Rep 2001, 2:117-120.
  • [102]Chowbina SR, Wu X, Zhang F, Li PM, Pandey R, Kasamsetty HN, et al.: HPD: an online integrated human pathway database enabling systems biology studies. BMC Bioinformatics 2009, 10(Suppl 1):S5. BioMed Central Full Text
  • [103]Huang H, Wu X, Sonachalam M, Mandape SN, Pandey R, MacDorman KF, et al.: PAGED: a pathway and gene-set enrichment database to enable molecular phenotype discoveries. BMC Bioinformatics 2012, 13(Suppl 1):S2. BioMed Central Full Text
  • [104]Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human Protein Reference Database--2009 update. Nucleic Acids Res. 2009;37(Database):D767–D772.
  • [105]Kamburov A, Stelzl U, Lehrach H, Herwig R: The ConsensusPathDB interaction database: 2013 update. Nucleic Acids Res 2013, 41(Database issue):D793-800.
  • [106]Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21:3448-9.
  文献评价指标  
  下载次数:4次 浏览次数:31次