期刊论文

【摘要】

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ2 feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.

【授权许可】

CC BY

【预览】

附件列表
Files	Size	Format	View
RO201901215645996ZK.pdf	835KB	PDF	download

Biomedical Informatics Insights
Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features:

Mindy K.Ross¹
关键词: text classification; text categorization; database; genome-wide association studies; GWAS; natural language processing;
DOI : 10.4137/BII.S11987
学科分类：医学（综合）
来源: Sage Journals
PDF


	文献评价指标
	下载次数：23次	浏览次数：30次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】