With advancing technology comes the need to extract information from increasingly high-dimensional data, whereas the number of samples is often limited or even acquired from imbalanced populations. This thesis develops strategies for classification and prediction in high-dimensional but poorly sampled problems arising in computational biology and medicine. These strategies are presented in 6 chapters. In Chapter II Support Vector Machine (SVM) classifiers are applied to localizing ventricular tachycardia from electrocardiographical data. In Chapters III, IV, V and VII optimization-driven structured sparsity algorithms are developed. In Chapter VI a class of uneven margin SVMs is proposed for learning binary classifiers with imbalanced training populations. The major part of this thesis is focused on group structured sparsity constrained statistical learning for sample-limited high-dimensional problems. Variable selection consists of reducing the dimension to a few important variables that contain most of the information necessary for discriminating between classes or for prediction of continuous responses. This can potentially avoid overfitting problems, improve generalizability of the predictors and provide better interpretation. Novel algorithms based on the augmented Larangian and ADMM methods are developed for various statistical learning problems with group structured sparsity penalty: binary SVMs with application to 3D cell microscopy data to discover important shape information for characterizing highly deformable cells; multi-class SVMs with application to gene expression analysis to improve disease prediction rate and control irrelevant patient variations; PLS regression with application to chemometrics, medicine, and agriculture applications. These applications demonstrate the benefit of sparsity constrained optimization approaches to high-dimensional problems with limited data.
【 预 览 】
附件列表
Files
Size
Format
View
Statistical Learning for Sample-Limited High-Dimensional Problems with Application to Biomedical Data.