学位论文详细信息
Text-classification methods and the mathematical theory of Principal Components
Text-classification;NLP;PCA;Online PCA;Incremental scheme;Naive Bayes;Partial labeling;KL divergence
Chen, Jiangning ; Matzinger, Heinrich Lounici, Karim Mathematics Popescu, Ionel Huo, Xiaoming Bonetto, Federico ; Matzinger, Heinrich
University:Georgia Institute of Technology
Department:Mathematics
关键词: Text-classification;    NLP;    PCA;    Online PCA;    Incremental scheme;    Naive Bayes;    Partial labeling;    KL divergence;   
Others  :  https://smartech.gatech.edu/bitstream/1853/61686/1/CHEN-DISSERTATION-2019.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

This thesis studies three topics. First of all, in text classification, one may use Principal Components Analysis (PCA) as a dimension reduction technique, or with few topics even as unsupervised classification method. We investigate how useful it is for real life problems. The problem is that, often times the spectrum of the covariance matrix is wrongly estimated due to the ratio between sample space dimension over feature space dimension not being large enough. We show how to reconstruct the spectrum of the ground truth covariance matrix, given the spectrum of the estimated covariance for multivariate normal vectors. We then present an algorithm for reconstruction the spectrum in the case of sparse matrices related to text classification. In the second part, we concentrate on schemes of PCA estimators. Consider the problem of finding the least eigenvalue and eigenvector of ground truth covariance matrix, a famous classical estimator are due to Krasulina. We state the convergence proof of Krasulina for the least eigenvalue and corresponding eigenvector, and then find their convergence rate. In the last part, we consider the application problem, text classification, in the supervised view with traditional Naive-Bayes method. We find out an updated Naive-Bayes method with a new loss function, which loses the unbiased property of traditional Naive-Bayes method, but obtains a smaller variance of the estimator.

【 预 览 】
附件列表
Files Size Format View
Text-classification methods and the mathematical theory of Principal Components 1068KB PDF download
  文献评价指标  
  下载次数:19次 浏览次数:31次