Journal of Biomedical Semantics | |
Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? | |
Hongfang Liu2  Kavishwar Wagholikar2  Manabu Torii1  | |
[1] Department of Radiology, Georgetown University Medical Center, Washington, DC, USA;Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, USA | |
关键词: Electronic health records; Data mining; Information storage and retrieval; Natural language processing; | |
Others : 806191 DOI : 10.1186/2041-1480-5-3 |
|
received in 2012-09-05, accepted in 2013-11-26, 发布年份 2014 | |
【 摘 要 】
Background
Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance.
Results
Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy.
Conclusions
The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
【 授权许可】
2014 Torii et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140708091134318.pdf | 274KB | download | |
Figure 2. | 50KB | Image | download |
Figure 1. | 91KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
【 参考文献 】
- [1]Kim J-D, Ohta T, Tateisi Y, Tsujii J: GENIA corpus--semantically annotated corpus for bio-textmining. Bioinformatics 2003, 19(Suppl 1):i180-i182.
- [2]Tanabe , Xie N, Thom LH, Matten W, Wilbur WJ: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S3. BioMed Central Full Text
- [3]Uzuner Ö, Solti I, Cadag E: Extracting medication information from clinical text. J Am Med Inform Assoc 2010, 17(5):514-518.
- [4]Uzuner Ö, South BR, Shen S, DuVall SL: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011, 18(5):552-556.
- [5]Doan , Xu H: Recognizing medication related entities in hospital discharge summaries using support vector machine. Beijing, China: presented at the 23rd International Conference on Computational Linguistics (COLING); 2010:259-266.
- [6]Halgrim SR, Xia F, Solti I, Cadag E, Uzuner O: Extracting Medication Information from Discharge Summaries. Los Angeles: presented at the NAACL-HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents; 2010:61-67.
- [7]Halgrim SR, Xia F, Solti I, Cadag E, Uzuner O: A cascade of classifiers for extracting medication information from discharge summaries. J Biomed Semantics 2011, 2(Suppl 3):S2. BioMed Central Full Text
- [8]Torii M, Wagholikar K, Liu H: Using machine learning for concept extraction on clinical documents from multiple data sources. J Am Med Inform Assoc 2011, 18(5):580-587.
- [9]Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y: Introduction to the bio-entity recognition task at JNLPBA. Geneva, Switzerland: presented at the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA); 2004:70-75.
- [10]Alias-i : LingPipe. http://alias-i.com/lingpipe webcite. [Accessed: 18-Feb-2012]
- [11]Ramshaw LA, Marcus MP: Text Chunking using Transformation-Based Learning. Somerset, New Jersey: presented at the ACL Third Workshop on Very Large Corpora; 1995:94-82.
- [12]Carpenter B: LingPipe for 99.99% Recall of Gene Mentions. Madrid, Spain: presented at the BioCreative Challenge Evaluation Workshop; 2007:307-309.
- [13]Carpenter B: Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO + . [Online]. Available: http://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/ webcite. [Accessed: 18-Feb-2012]
- [14]de Bruijn B, Cherry C, Kiritchenko S, Martin J, Zhu X: Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. J Am Med Inform Assoc 2011, 18(5):557-562.
- [15]Zhou G, Su J: Exploring Deep Knowledge Resources in Biomedical Name Recognition. Geneva, Switzerland: presented at the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA); 2004:96-99.