Advances in data collection and social media have led tomore and more network data appearing in diverse areas, suchas social sciences, internet, transportation and biology. This thesis develops new principled statistical tools for network analysis,with emphasis on both appealing statistical properties andcomputational efficiency.Our first project focuses on building prediction models fornetwork-linked data. Prediction algorithms typically assume thetraining data are independent samples, but in many modern applicationssamples come from individuals connected by a network.For example, inadolescent health studies of risk-taking behaviors, information on thesubjects;; social network is often available and plays an importantrole through network cohesion, the empirically observed phenomenon offriends behaving similarly.Taking cohesion into account inprediction models should allow us to improve their performance.We propose a network-based penalty on individual node effects to encourage similarity between predictions for linked nodes, and show that incorporating it into prediction leads to improvement overtraditional models both theoretically and empirically when networkcohesion is present.The penalty can be used with many loss-basedprediction methods, such as regression, generalized linear models, andCox;;s proportional hazard model.Applications to predicting levels ofrecreational activity and marijuana usage among teenagers from theAddHealth study based on both demographic covariates and friendshipnetworks are discussed in detail.Our approach to takingfriendships into account can significantly improve predictions ofbehavior while providing interpretable estimates of covariate effects.Resampling, data splitting, and cross-validation are powerful general strategies in statistical inference, but resampling from a network remainsa challenging problem.Many statistical models and methods for networks need model selection and tuning parameters, which could be done by cross-validation if we had a good method for splitting network data; however,splittingnetwork nodes into groups requires deleting edges and destroys some ofthe structure.Here we propose a new network cross-validationstrategy based on splitting edges rather than nodes, which avoidslosing information and is applicable to a wide range of networkmodels.We provide a theoretical justification for our method in ageneral setting and demonstrate how our method can be used in anumber of specific model selection and parameter tuning tasks, with extensivenumerical results on simulated networks. We also apply the method to analysis of a citationnetwork of statisticians and obtain meaningful research communities.Finally, we consider the problem of community detection on partiallyobserved networks.However, inpractice, network data are often collected through samplingmechanisms, such as survey questionnaires, instead of directobservation.The noise and bias introduced by such sampling mechanisms can obscure the community structure and invalidate the assumptions of standard community detectionmethods. We propose a model toincorporate neighborhood sampling, through a model reflective of survey designs, into community detection for directed networks, since friendship networks obtained from surveys are naturally directed.We model the edge sampling probabilities as a function of both individual preferences and community parameters, and fit the model by a combination of spectral clustering and the method ofmoments. The algorithm is computationally efficient and comes with a theoretical guarantee of consistency.We evaluate the proposedmodel in extensive simulation studies and applied it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.
【 预 览 】
附件列表
Files
Size
Format
View
Statistical Tools for Network Data: Prediction andResampling