期刊论文

【摘要】

Data can be represented in many different ways within a particular document or set of documents. Hence, attempts to automatically process the relationships between documents or determine the relevance of certain document objects can be problematic. In this study, we have developed software to automatically catalog objects contained in HTML files for patents granted by the United States Patent and Trademark Office (USPTO). Once these objects are recognized, the software creates metadata that assigns a data type to each document object. Such metadata can be easily processed and analyzed for subsequent text mining tasks. Specifically, document similarity and clustering techniques were applied to a subset of the USPTO document collection. Although our preliminary results demonstrate that tables and numerical data do not provide quantifiable value to a document’s content, the stage for future work in measuring the importance of document objects within a large corpus has been set.

【授权许可】

【预览】

附件列表
Files	Size	Format	View
RO202003190041273ZK.pdf	1308KB	PDF	download

Algorithms
The Effects of Tabular-Based Content Extraction on Patent Document Clustering

Denise R. Koessler¹ Benjamin W. Martin¹ Bruce E. Kiefer²
[1] EECS Department, Min H. Kao Building Suite 401, University of Tennessee, 1520 Middle Drive, Knoxville, TN 37996, USA; E-Mails:;Catalyst Repository Systems, 1860 Blake Street, 7th Floor, Denver, CO 80202, USA; E-Mail:
关键词: text mining; patent documents; table data;
DOI : 10.3390/a5040490
来源: mdpi
PDF


	文献评价指标
	下载次数：10次	浏览次数：12次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】