BMC Bioinformatics | |
Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data | |
Caroline Truntzer2  Elise Mostacci2  Aline Jeannin2  Jean-Michel Petit1  Patrick Ducoroy2  Hervé Cardot2  | |
[1] Service Endocrinologie, Centre Hospitalier Universitaire, Dijon 21000, France | |
[2] University of Burgundy, Dijon 21000, France | |
关键词: Clinical data; Classification methods; Biomarkers; Predictive value; High-dimension; | |
Others : 1084729 DOI : 10.1186/s12859-014-0385-z |
|
received in 2013-06-10, accepted in 2014-11-12, 发布年份 2014 | |
【 摘 要 】
Background
The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. Technologies like mass spectrometry are commonly being used in proteomic research. Mass spectrometry signals show the proteomic profiles of the individuals under study at a given time. These profiles correspond to the recording of a large number of proteins, much larger than the number of individuals. These variables come in addition to or to complete classical clinical variables. The objective of this study is to evaluate and compare the predictive ability of new and existing models combining mass spectrometry data and classical clinical variables. This study was conducted in the context of binary prediction.
Results
To achieve this goal, simulated data as well as a real dataset dedicated to the selection of proteomic markers of steatosis were used to evaluate the methods. The proposed methods meet the challenge of high-dimensional data and the selection of predictive markers by using penalization methods (Ridge, Lasso) and dimension reduction techniques (PLS), as well as a combination of both strategies through sparse PLS in the context of a binary class prediction. The methods were compared in terms of mean classification rate and their ability to select the true predictive values. These comparisons were done on clinical-only models, mass-spectrometry-only models and combined models.
Conclusions
It was shown that models which combine both types of data can be more efficient than models that use only clinical or mass spectrometry data when the sample size of the dataset is large enough.
【 授权许可】
2014 Truntzer et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150113163909185.pdf | 550KB | download | |
Figure 6. | 12KB | Image | download |
Figure 5. | 20KB | Image | download |
Figure 4. | 23KB | Image | download |
Figure 3. | 15KB | Image | download |
Figure 2. | 38KB | Image | download |
Figure 1. | 29KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
【 参考文献 】
- [1]Boulesteix A-L, Sauerbrei W: Added predictive value of high-throughput molecular data to clinical data and its validation. Brief Bioinform 2011, 12(3):215-19.
- [2]Truntzer C, Maucort-Boulch D, Roy P: Comparative optimism in models involving both classical clinical and gene expression information. BMC Bioinformatics 2008, 9(1):434. BioMed Central Full Text
- [3]Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd edn, New-York: Springer; 2009.
- [4]Boulesteix A-L, Strobl C, Augustin T, Daumer M: Evaluating microarray-based classifiers: an overview. Cancer Inform 2008, 6:77-97.
- [5]Bühlmann P, van de Geer S: Statistics for High-Dimensional Data: Methods, Theory and Applications, Berlin, Heidelberg: Springer; 2011.
- [6]Hoerl A, Kennard W: Ridge regression: Applications to nonorthogonal problems. Technometrics 1970, 12(1):69-82.
- [7]Tibshirani R: Regression shrinkage, selection via the lasso. J R Statist Soc B 1996, 58:267-288.
- [8]Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Statist Soc B 2005, 67:301-320.
- [9]Bühlmann P: Boosting for high-dimensional linear models. Ann Stat 2006, 34(2):559-583.
- [10]Friedman J, Hastie T, Tibshirani R: Additive logistic regression: a statistical view of boosting. Ann Stat 2000, 28(2):337-407.
- [11]Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat 2004, 32(2):407-451.
- [12]Bühlmann P, Hothorn T: Boosting algorithms: Regulation, prediction, and model fitting. Stat Sci 2007, 22(4):477-506.
- [13]Wold H: Partial Least Squares. Edited by Kots S, Johnson NL, New York: Wiley; 2005.
- [14]Tenenhaus M: La Régression PLS. Théorie et Pratique, Paris, Editions Technip; 1998.
- [15]Helland S: Partial least squares and statistical models. Scandinavian J Stat 1990, 17(2):97-114.
- [16]Fort G, Lambert-Lacroix S: Classification using partial least squares with penalized logistic regression. Bioinformatics 2005, 21(7):1104-1111.
- [17]Chung D, Keles S: Sparse partial least squares classification for high dimensional data. Stat Appl Genet Mol Biol 2010, 9(1):17.
- [18]Boulesteix A-L, Hothorn T: Testing the additional predictive value of high dimensional molecular data. BMC Bioinformatics 2010, 11:78. BioMed Central Full Text
- [19]MacCullagh J, Nelder J: Generalized Linear model. Second Edition, New York: Chapman & Hall; 1989.
- [20]Green P: Iteratively reweighted least squares for maximum likelihood estimation and some robust and resistant alternatives. J R Stat Soc B 1984, 46(2):149-192.
- [21]Breiman L: Prediction games and arcing algorithms. Neural Comput 1999, 11:1493-1517.
- [22]Friedman J: Greedy function approximation: a gradient boosting machine. Ann Stat 2001, 29:1189-1232.
- [23]Le Cessie S, Van Houwelingen J: Ridge estimators in logistic regression. J R Soc Series C 1992, 41(1):191-201.
- [24]Coombes K, Baggerly K, Morris J: Pre-Processing Mass Spectrometry Data, Fundamentals of Data Mining in Genomics and Proteomics. Edited by Dubitzky M, Granzow M, Berrar D, Boston: Kluwer; 2007.
- [25]Coombes K, Tsavachidis S, Morris J, Baggerly K, Hung M-C, Kuerer H: Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics 2005, 5:4107-4117.
- [26]Morris J, Coombes K, Koomen J, Baggerly K, Kobayashi R: Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 2005, 21:1764-1775.
- [27]Johnson WE, Rabinovic A, Li C: Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 2006, 8(1):118-127.
- [28][http://CRAN.R-project.org/package=mboost] webcite Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B: mboost: Model-based boosting; 2013. R package version 2.2-3,
- [29]Lockhart R, Taylor J, Tibshirani R-J, Tibshirani R: A significance test for the lasso. Ann Stat 2014, 42:413-468.