学位论文

【摘要】

This thesis studies three topics. First of all, in text classification, one may use Principal Components Analysis (PCA) as a dimension reduction technique, or with few topics even as unsupervised classification method. We investigate how useful it is for real life problems. The problem is that, often times the spectrum of the covariance matrix is wrongly estimated due to the ratio between sample space dimension over feature space dimension not being large enough. We show how to reconstruct the spectrum of the ground truth covariance matrix, given the spectrum of the estimated covariance for multivariate normal vectors. We then present an algorithm for reconstruction the spectrum in the case of sparse matrices related to text classification. In the second part, we concentrate on schemes of PCA estimators. Consider the problem of finding the least eigenvalue and eigenvector of ground truth covariance matrix, a famous classical estimator are due to Krasulina. We state the convergence proof of Krasulina for the least eigenvalue and corresponding eigenvector, and then find their convergence rate. In the last part, we consider the application problem, text classification, in the supervised view with traditional Naive-Bayes method. We find out an updated Naive-Bayes method with a new loss function, which loses the unbiased property of traditional Naive-Bayes method, but obtains a smaller variance of the estimator.

【预览】

附件列表
Files	Size	Format	View
Text-classification methods and the mathematical theory of Principal Components	1068KB	PDF	download


Text-classification methods and the mathematical theory of Principal Components
Text-classification;NLP;PCA;Online PCA;Incremental scheme;Naive Bayes;Partial labeling;KL divergence
Chen, Jiangning ; Matzinger, Heinrich Lounici, Karim Mathematics Popescu, Ionel Huo, Xiaoming Bonetto, Federico ; Matzinger, Heinrich
University:Georgia Institute of Technology
Department:Mathematics
关键词: Text-classification; NLP; PCA; Online PCA; Incremental scheme; Naive Bayes; Partial labeling; KL divergence;
Others : https://smartech.gatech.edu/bitstream/1853/61686/1/CHEN-DISSERTATION-2019.pdf
美国\|英语
来源: SMARTech Repository
PDF


	文献评价指标
	下载次数：31次	浏览次数：32次

【 摘 要 】

【 预 览 】

【摘要】

【预览】