期刊论文详细信息
BMC Bioinformatics
Model averaging strategies for structure learning in Bayesian networks with limited data
Research
Bradley M Broom1  Kim-Anh Do2  Devika Subramanian3 
[1] Department of Bioinformatics and Computational Biology, UT MD Anderson Cancer Center, 77030, Houston, Texas, USA;Department of Biostatistics, UT MD Anderson Cancer Center, 77030, Houston, Texas, USA;Department of Computer Science, Rice University, 77005, Houston, Texas, USA;
关键词: Posterior Probability;    Bayesian Network;    Feature Probability;    Structure Learning;    Bootstrap Resamples;   
DOI  :  10.1186/1471-2105-13-S13-S10
来源: Springer
PDF
【 摘 要 】

BackgroundConsiderable progress has been made on algorithms for learning the structure of Bayesian networks from data. Model averaging by using bootstrap replicates with feature selection by thresholding is a widely used solution for learning features with high confidence. Yet, in the context of limited data many questions remain unanswered. What scoring functions are most effective for model averaging? Does the bias arising from the discreteness of the bootstrap significantly affect learning performance? Is it better to pick the single best network or to average multiple networks learnt from each bootstrap resample? How should thresholds for learning statistically significant features be selected?ResultsThe best scoring functions are Dirichlet Prior Scoring Metric with small λ and the Bayesian Dirichlet metric. Correcting the bias arising from the discreteness of the bootstrap worsens learning performance. It is better to pick the single best network learnt from each bootstrap resample. We describe a permutation based method for determining significance thresholds for feature selection in bagged models. We show that in contexts with limited data, Bayesian bagging using the Dirichlet Prior Scoring Metric (DPSM) is the most effective learning strategy, and that modifying the scoring function to penalize complex networks hampers model averaging. We establish these results using a systematic study of two well-known benchmarks, specifically ALARM and INSURANCE. We also apply our network construction method to gene expression data from the Cancer Genome Atlas Glioblastoma multiforme dataset and show that survival is related to clinical covariates age and gender and clusters for interferon induced genes and growth inhibition genes.ConclusionsFor small data sets, our approach performs significantly better than previously published methods.

【 授权许可】

CC BY   
© Broom et al; licensee BioMed Central Ltd. 2012

【 预 览 】
附件列表
Files Size Format View
RO202311102040679ZK.pdf 5077KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  文献评价指标  
  下载次数:2次 浏览次数:0次