When there are a large number of predictors and few observations, building a regression model to explain the behavior of a response variable such as a patient's medical condition is very challenging. This is a "p ≫n " variable selection problem encountered often in modern applied statistics and data mining. Chapter one of this thesis proposes a rigorous procedure which groups predictors into clusters of "highly-correlated" variables, selects a representative from each cluster, and uses a subset of the representatives for regression modeling. The proposed Penalized method based on Representatives (PR) extends the Lasso for the p ≫ n data and highly correlated variables, to build a sparse model practically interpretable and maintain prediction quality. Moreover, we provide the PR-Sequential Grouped Regression (PR-SGR) to make computation of the PR procedure efficient. Simulation studies show the proposed method outperforms existing methods such as the Lasso/Lars. A real-life example from a mental health diagnosis illustrates the applicability of the PR-SGR. In the second part of the thesis, we study the analysis of time-to-event data called a gap data when missing time intervals (gaps) possibly happen prior to the first observed event time. If a gap occurs prior to the first observed event, then the first observed event may or may not be the first true event. This incomplete knowledge makes the gap data different from the well-studied regular interval censored data. We propose a Non-Parametric Estimate for the Gap data (NPEG) to estimate the survival function for the first true event time, derive its analytic properties and demonstrate its performance in simulations. We also extend the Imputed Empirical Estimating method (IEE), which is an existing nonparametric method for the gap data up to one gap, to handle the gap data with multiple gaps.
【 预 览 】
附件列表
Files
Size
Format
View
Penalized method based on representatives and nonparametric analysis of gap data