期刊论文详细信息
BMC Bioinformatics
A novel procedure on next generation sequencing data analysis using text mining algorithm
Research Article
Yuping Wang1  James J. Chen1  Zhichao Liu1  Huixiao Hong1  Roger Perkins1  Weida Tong1  Wen Zou1  Weizhong Zhao2 
[1] Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, HFT-20, 72079, Jefferson, AR, USA;Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, HFT-20, 72079, Jefferson, AR, USA;College of Information Engineering, Xiangtan University, Xiangtan, Hunan Province, China;
关键词: Data mining;    Topic modeling;    Next-generation sequencing (NGS);    Genetic diversity;    Biomarker;   
DOI  :  10.1186/s12859-016-1075-9
 received in 2015-10-20, accepted in 2016-05-07,  发布年份 2016
来源: Springer
PDF
【 摘 要 】

BackgroundNext-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining.MethodsWe report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure.ResultsThe output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.ConclusionThe implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.

【 授权许可】

CC BY   
© Zhao et al. 2016

【 预 览 】
附件列表
Files Size Format View
RO202311095374853ZK.pdf 2386KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  • [36]
  • [37]
  • [38]
  • [39]
  • [40]
  • [41]
  • [42]
  • [43]
  • [44]
  • [45]
  • [46]
  • [47]
  • [48]
  • [49]
  • [50]
  文献评价指标  
  下载次数:8次 浏览次数:0次