期刊论文

【摘要】

BackgroundData from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model.Results and conclusionsThe conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard version, without degrading recall. Moreover, the largest performance gains are achieved in the most important part of the operating range: the top of prioritized gene list.

【授权许可】

CC BY
© Popovic et al.; licensee BioMed Central Ltd. 2015

【预览】

附件列表
Files	Size	Format	View
RO202311100912405ZK.pdf	3152KB	PDF	download

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]

BMC Bioinformatics
Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
Research
Jesse Davis¹ Bart De Moor² Yves Moreau² Dusan Popovic² Alejandro Sifrim³
[1] KU Leuven, Department of Computer Science, Celestijnenlaan 200A, B-3001, Leuven, Belgium;KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, box 2446, Kasteelpark Arenberg 10, B-3001, Leuven, Belgium;iMinds Medical IT, box 2446, Kasteelpark Arenberg 10, B-3001, Leuven, Belgium;Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton Cambridge, UK;
关键词: data granularity; validation bias; learning bias; hierarchical sampling; bootstrapping; eXtasy; Random forest; ensemble classifiers;
DOI : 10.1186/1471-2105-16-S4-S2
来源: Springer
PDF


	文献评价指标
	下载次数：6次	浏览次数：0次

【 摘 要 】

【 授权许可】

【 预 览 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【参考文献】