Entity Recognition (ER) can be used as a method for extracting information about socio-technical systems from unstructured, natural language text data. This process is limited by the set of entity classes considered in many current ER solutions. In this thesis, we report on the development of an ER classifier that supports a wide range of entity classes that are relevant for analyzing multi-modal, socio-technical systems. Another limitation with current entity extractors is that they mainly support the detection of named entities, typically in the form of proper nouns. The presented solution also detects entities not referred to by a name, such as general references to places (e.g. forest) or natural resources (e.g. timber). We use supervised machine learning for this project. To overcome data sparseness issues that results from considering a large number of entity classes, we built two separate classifiers for predicting labels for entity boundary and class. We herein investigate rules for merging both labels while minimizing the loss of accuracy due to this step. The accuracy of our classifier for the largest model with 94 classes achieves 75.9%. We compare the performance of our solution to other standard systems on several datasets, finding that with the same number of classes,the accuracy of our classifier is comparable to other state-of-the-art ER packages.
【 预 览 】
附件列表
Files
Size
Format
View
Entity recognition for multi-modal socio-technical systems