科技报告详细信息
Text-mining based journal splitting | |
Lin, Xiaofan | |
HP Development Company | |
关键词: table of contents; OCR; journal splitting; text mining; text chunking; document understanding; | |
RP-ID : HPL-2001-137R1 | |
学科分类:计算机科学(综合) | |
美国|英语 | |
来源: HP Labs | |
【 摘 要 】
This paper introduces a novel journal splitting algorithm. It takes full advantage of various kinds of information such as text match, layout and page numbers. The core procedure is a highly efficient text-mining algorithm, which detects the matched phrases between the content pages and the title pages of individual articles. Experiments show that this algorithm is robust and able to split a wide range of journals, magazines and books. 5 Pages
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201804100002184LZ | 306KB | download |