With the rapid development of advanced sensing technology, rich and complex real-time high-dimensional streaming data are available in many systems, such as manufacturing, wireless communication, biosurveillance, and social systems. As information is accumulated over time at a fast rate by multiple sensors, it is highly desirable to develop efficient methodologies that enable to (1) extract informatic features, (2) learn the process status and detect possible changes or faults quickly, (3) implement and compute online fast, (4) be robust to outliers or model misspecification. Therefore, efficient robust and scalable schemes and algorithms, which enable real-time monitoring of high-dimensional data streams, are highly demanded. This thesis focuses on statistical modeling to extract informative and robust features, to interpret the characteristic of the system, and to develop efficient and robust monitoring schemes that can be implemented recursively and in parallel to reduce unnecessary transition costs in the data fusion systems. The methodologies developed in the thesis are generic and can be applied to a variety of fields ranging from manufacturing processes (e.g. forging, stamping processes, semiconductor process), where functional profile data are observed sequentially, to video monitoring (e.g. Solar flare detection), where image data are collected for sequential decision making. This thesis starts with theoretical research on change-point detection and robust M-estimation. In Chapter 1, we propose a scalable robust monitoring scheme that can detect the small but systematic change of the system efficiently and in real-time when there are some random transient outliers. We construct a new robust local detection statistic called $L_{\alpha}$-CUSUM statistic that can reduce the effect of outliers by using the Box-Cox transformation of the likelihood function. Moreover, we propose a new concept called false-alarm breakdown point to measure the robustness of online monitoring schemes and characterize the breakdown point of our proposed schemes. In Chapter 2, we develop some families of communication-efficient schemes for monitoring large-scale data streams. We use some shrinkage transformations such as soft-thresholding, hard-thresholding and order-thresholding on the local monitoring statistics so that to filter out unaffected data streams and save communication costs in the data fusion networks. Moreover, we conduct the detection delay analysis on our proposed schemes in both classical low-dimensional regime and modern high-dimensional regime and show that under certain conditions, our schemes are asymptotical optimal by only receiving a small proportion of data, which can reduce the transition costs. In Chapter 3, we investigate two important properties of M-estimator, namely, robustness and tractability, in linear regression setting, when the observations are contaminated by some arbitrary outliers. By learning the landscape of the empirical risk, we show that under mild conditions when the percentage of outliers is small, many M-estimators enjoy nice robustness, which means the estimator is close to the true underlying parameter, and tractability properties, which means the estimator can be computed efficiently, even if the loss function is non-convex. Then, in Chapter 4, we work on the applied research on nonlinear profile monitoring based on discrete Wavelet transform. We proposed the recursive CUSUM procedure that can learn the out-of-control parameters adaptively and detect unknown change efficiently. In Chapter 5, we develop a functional Poisson regression model for papers’ cumulative citations data. Based on our model, we can fit and learn the individual paper’s citation characteristic well. Our proposed model is also used for clustering different citation patterns, which can provide implications for bibliometric studies and research evaluations. Finally, we summarize our original contributions and future research plans in Chapter 6.
【 预 览 】
附件列表
Files
Size
Format
View
Robust sparse learning and monitoring of high-dimensional data