With advancements in genomic technologies, it is common to have two high-dimensional datasets, each measuring one underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers that best measure the underlying biological process. This same process is also measured by W, coming from prior technology but correlated with X. We have (Y,X,W) on a moderately-sized sample and (Y,W) on a larger sample. We utilize the data on W to boost prediction of Y by X when p is large. Our work is motivated by a dataset containing gene-expression measurements from both quantitative real-time polymerase chain reaction and microarray technologies. First, we propose a class of targeted ridge (TR) estimators that shrink the regression coefficients of Y on X toward targets derived using the larger dataset and give two specific TR estimators. A hybrid estimator combines multiple TR estimators, data-adaptively balancing efficiency and robustness. Next, we view the problem from a Bayesian perspective. Hyperparameters control the shrinkage of the model parameters, giving flexibility in terms of what to shrink and to what extent. All unknown quantities – the missing X’s from the larger sample, the model parameters, and the shrinkage parameters – are iteratively sampled. Alternatively, we show how Empirical Bayes methods which maximize marginal likelihoods can estimate the shrinkage parameters. Finally, we consider estimating the tuning parameter of a ridge regression, particularly when the sample size is small relative to the number of predictors. A proposed corrected generalized cross-validation criterion is not subject to overfitting but remains asymptotically optimal. We also define a hyperpenalty that shrinks the tuning parameter itself, protecting against over- or underfitting. Maximizing the ;;hyperpenalized;; likelihood can yield smaller prediction error than many common alternatives. Embedding the hyperpenalty into the penalized EM algorithm yields a hyperpenalized EM algorithm, which may be applied to the original missing data prediction problem. All of the approaches are compared via simulation studies and applied to the motivating gene-expression dataset. This dissertation therefore contributes to the literature on missing data and measurement error methods as they relate to prediction in high-dimensional models.
【 预 览 】
附件列表
Files
Size
Format
View
Shrinkage Methods Utilizing Auxiliary Information to Improve High-Dimensional Prediction Models.