Large-scale data or big data is an enormously popular word in the data science and statistics communities. These datasets are often collected over periods of time - at hourly and weekly rates - with the help of technological advancements in physical and cloud-based storage. The information stored is useful, especially in biomedicine, insurance, and retail, where patients and customers are crucial to business survival. In this thesis, we develop new statistical methodologies for handling two types of datasets:continuous data and binary data.Time-varying associations among store products provide important information to capture changes in consumer shopping behavior. In the first part of this thesis, we propose a longitudinal principal component analysis (LPCA) using a random-effects eigen-decomposition, where the eigen-decomposition utilizes longitudinal information over time to model time-varying eigenvalues and eigenvectors of the corresponding covariance matrices. Our method can effectively analyze large marketing data containing sales information for selected consumer products from hundreds of stores over an 11-year time period. The proposed method leads to more accurate estimation and interpretation compared to comparable approaches, which is illustrated through finite sample simulations. We show our method's capabilities and provide an interpretation of the eigenvector estimates in an application to IRI marketing data.In the second part of this thesis, we formulate the LPCA problem for binary data. We propose capturing the associations among the products or variables through the odds ratios, where a two by two contingency table contains probabilities representing the joint distribution of two binary products. The eigen-decomposition utilizes longitudinal information over time to model time-varying eigenvalues and eigenvectors of the corresponding odds ratio matrices. These odds ratio matrices measure the pairwise associations among the binary products and is more appropriate to use than the Pearson correlation coefficient. Our method illustrates an improvement in visualization and interpretation through simulation studies and an application to IRI panel data of individual customer purchases.
【 预 览 】
附件列表
Files
Size
Format
View
Longitudinal principal components analysis for binary and continuous data