Biodiversity Information Science and Standards | |
Flora Prepper: Preparing floras for morphological parsing and integration | |
article | |
Jocelyn Pender1  | |
[1] Agriculture and Agri-Food Canada | |
关键词: Flora; text mining; natural language processing; Python; annotation; | |
DOI : 10.3897/biss.3.37743 | |
来源: Pensoft | |
【 摘 要 】
The increased availability of digital floras and the application of optical character recognition (OCR) to digitized texts has resulted in exciting opportunities for flora data mining. For example, the software package CharaParser has been developed for the semantic annotation of morphological descriptions from taxonomic treatments (Cui 2012). However, after digitization and OCR processing and before parsing of morphological treatments can begin, content types must be annotated (i.e., text represents names, morphology, discussion or distribution). In addition to enabling morphological parsing, content type annotation also facilitates content search and data linkage. For example, by annotating pieces of a floral treatment, assertions from various floras of the same type can be combined into a single document (i.e., a "mash-up" floral treatment). Several products and pipelines have been developed for the semantic annotation, or mark-up, of taxonomic documents (e.g., GoldenGATE, FlorML; Sautter et al. 2012, Hamann et al. 2014). However, these products lack a combination of both ease of implementation (e.g., the ability to run as a script in a programmatic workflow) and the use of modern parsing methods, such as text mining and Natural Language Processing (NLP) approaches.Here I present a pilot project implementing text mining and NLP approaches to marking-up floras implemented in Python. I will describe the success of the project, and summarize lessons learned, especially in relation to previous flora markup projects. Annotation of existing flora documents is an essential step towards building next-generation floras (i.e., mash-ups and enhanced floras as platforms) and enables automated trait extraction. Building an easy-to-use access point to modern text mining and NLP techniques for botanical literature will allow for more flexible and responsive flora annotation, and is an important step towards realizing botanical data integration goals.
【 授权许可】
Unknown
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202307130001944ZK.pdf | 55KB | download |