科技报告详细信息
Text-mining based journal splitting
Lin, Xiaofan
HP Development Company
关键词: table of contents;    OCR;    journal splitting;    text mining;    text chunking;    document understanding;   
RP-ID  :  HPL-2001-137R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

This paper introduces a novel journal splitting algorithm. It takes full advantage of various kinds of information such as text match, layout and page numbers. The core procedure is a highly efficient text-mining algorithm, which detects the matched phrases between the content pages and the title pages of individual articles. Experiments show that this algorithm is robust and able to split a wide range of journals, magazines and books. 5 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100002184LZ 306KB PDF download
  文献评价指标  
  下载次数:14次 浏览次数:53次