科技报告详细信息
| Text-mining based journal splitting | |
| Lin, Xiaofan | |
| HP Development Company | |
| 关键词: table of contents; OCR; journal splitting; text mining; text chunking; document understanding; | |
| RP-ID : HPL-2001-137R1 | |
| 学科分类:计算机科学(综合) | |
| 美国|英语 | |
| 来源: HP Labs | |
PDF
|
|
【 摘 要 】
This paper introduces a novel journal splitting algorithm. It takes full advantage of various kinds of information such as text match, layout and page numbers. The core procedure is a highly efficient text-mining algorithm, which detects the matched phrases between the content pages and the title pages of individual articles. Experiments show that this algorithm is robust and able to split a wide range of journals, magazines and books. 5 Pages
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| RO201804100002184LZ | 306KB |
PDF