期刊论文详细信息
BMC Genomics
A comparison of machine learning and Bayesian modelling for molecular serotyping
Research Article
Lorenz Wernisch1  Richard Newton1 
[1] MRC Biostatistics Unit, Robinson Way, CB2 0SR, Cambridge, UK;
关键词: Streptococcus pneumoniae;    Serotyping;    Bayesian;    Machine learning;    Gradient Boosting Machine;    Random Forest;   
DOI  :  10.1186/s12864-017-3998-6
 received in 2017-03-10, accepted in 2017-08-01,  发布年份 2017
来源: Springer
PDF
【 摘 要 】

BackgroundStreptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model.ResultsWe compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays.ConclusionsWith the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example.

【 授权许可】

CC BY   
© The Author(s) 2017

【 预 览 】
附件列表
Files Size Format View
RO202311092893311ZK.pdf 1594KB PDF download
12914_2017_112_Article_IEq4.gif 1KB Image download
【 图 表 】

12914_2017_112_Article_IEq4.gif

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  文献评价指标  
  下载次数:4次 浏览次数:0次