科技报告详细信息
| Header and Footer Extraction by Page-Association | |
| Lin, Xiaofan | |
| HP Development Company | |
| 关键词: document structure analysis; optical character recognition; header/footer extraction; digit content re-mastering; | |
| RP-ID : HPL-2002-129 | |
| 学科分类:计算机科学(综合) | |
| 美国|英语 | |
| 来源: HP Labs | |
PDF
|
|
【 摘 要 】
This paper introduces a robust algorithm to extract headers and footers from a variety of electronic documents such as image files, Adobe PDF files and files generated by Optical Character Recognition (OCR). Compared with the conventional methods based on page-level layout and format, the proposed novel strategy considers a page in the context of neighboring pages. Through such page-association, the headers and footers on a variety of documents can be automatically detected without human interference. In addition, the application of fuzzy string match also make the method resistant against OCR errors.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| RO201804100001846LZ | 414KB |
PDF