学位论文

【摘要】

With advancements in genomic technologies, it is common to have two high-dimensional datasets, each measuring one underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers that best measure the underlying biological process. This same process is also measured by W, coming from prior technology but correlated with X. We have (Y,X,W) on a moderately-sized sample and (Y,W) on a larger sample. We utilize the data on W to boost prediction of Y by X when p is large. Our work is motivated by a dataset containing gene-expression measurements from both quantitative real-time polymerase chain reaction and microarray technologies. First, we propose a class of targeted ridge (TR) estimators that shrink the regression coefficients of Y on X toward targets derived using the larger dataset and give two specific TR estimators. A hybrid estimator combines multiple TR estimators, data-adaptively balancing efficiency and robustness. Next, we view the problem from a Bayesian perspective. Hyperparameters control the shrinkage of the model parameters, giving flexibility in terms of what to shrink and to what extent. All unknown quantities – the missing X’s from the larger sample, the model parameters, and the shrinkage parameters – are iteratively sampled. Alternatively, we show how Empirical Bayes methods which maximize marginal likelihoods can estimate the shrinkage parameters. Finally, we consider estimating the tuning parameter of a ridge regression, particularly when the sample size is small relative to the number of predictors. A proposed corrected generalized cross-validation criterion is not subject to overfitting but remains asymptotically optimal. We also define a hyperpenalty that shrinks the tuning parameter itself, protecting against over- or underfitting. Maximizing the ;;hyperpenalized;; likelihood can yield smaller prediction error than many common alternatives. Embedding the hyperpenalty into the penalized EM algorithm yields a hyperpenalized EM algorithm, which may be applied to the original missing data prediction problem. All of the approaches are compared via simulation studies and applied to the motivating gene-expression dataset. This dissertation therefore contributes to the literature on missing data and measurement error methods as they relate to prediction in high-dimensional models.

【预览】

附件列表
Files	Size	Format	View
Shrinkage Methods Utilizing Auxiliary Information to Improve High-Dimensional Prediction Models.	1629KB	PDF	download


Shrinkage Methods Utilizing Auxiliary Information to Improve High-Dimensional Prediction Models.
Genomics;Missing Data;Measurement Error;Ridge Regression;Penalized Likelihood;EM Algorithm;Statistics and Numeric Data;Science;Biostatistics
Boonstra, Philip S.Raghunathan, Trivellore E. ;
University of Michigan
关键词: Genomics; Missing Data; Measurement Error; Ridge Regression; Penalized Likelihood; EM Algorithm; Statistics and Numeric Data; Science; Biostatistics;
Others : https://deepblue.lib.umich.edu/bitstream/handle/2027.42/95993/philb_1.pdf?sequence=1&isAllowed=y
瑞士\|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF


	文献评价指标
	下载次数：16次	浏览次数：47次

【 摘 要 】

【 预 览 】

【摘要】

【预览】