学位论文

【摘要】

In this thesis, we present a new clustering algorithm we callSignificance Feature Clustering, which is designed to clustertext documents. Its central premise is the mapping of rawfrequency count vectors to discrete-valued significance vectorswhich contain values of -1, 0, or 1. These values representwhether a word is significantly negative,neutral, orsignificantly positive, respectively. Initially, standardtf-idf vectors are computed from raw frequency vectors, then thesetf-idf vectors are transformed to significance vectors using aparameter alpha, where alpha controls the mapping -1, 0, or1 for each vector entry. SFC clusters agglomeratively, with eachdocument;;s significance vector representing a cluster of size onecontaining just the document, and iteratively merges the twoclusters that exhibit the most similar average using cosinesimilarity. We show that by using a good alpha value, thesignificance vectors produced by SFC provide an accurateindication of which words are significant to which documents, aswell as the type of significance, and therefore correspondinglyyield a good clustering in terms of a well-known definition ofclustering quality. We further demonstrate that a user need notmanually select an alpha as we develop a new definition ofclustering quality that is highly correlated with text clusteringquality. Our metric extends the family of metrics known as internal similarity, so that it can be applied to a tree ofclusters rather than a set, but it also factors in an aspect ofrecall that was absent from previous internal similarity metrics.Using this new definition of internal similarity, which we callmaximum tree internal similarity, we show that a close tooptimal text clustering may be picked from any number ofclusterings created by different alpha;;s. The automaticallyselected clusterings have qualities that are close to that of awell-known and powerful hierarchical clustering algorithm.

【预览】

附件列表
Files	Size	Format	View
Significant Feature Clustering	902KB	PDF	download


Significant Feature Clustering
Computer Science;clustering;tf-idf vectors;data representations
Whissell, John
University of Waterloo
关键词: Computer Science; clustering; tf-idf vectors; data representations;
Others : https://uwspace.uwaterloo.ca/bitstream/10012/2926/1/jswhisse2006.pdf
瑞士\|英语
来源: UWSPACE Waterloo Institutional Repository
PDF


	文献评价指标
	下载次数：5次	浏览次数：28次

【 摘 要 】

【 预 览 】

【摘要】

【预览】