期刊论文

【摘要】

BackgroundAccurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection.ResultsIn this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set.ConclusionsOur process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches.

【授权许可】

CC BY
© Fujimoto et al. 2016

【预览】

附件列表
Files	Size	Format	View
RO202311092141402ZK.pdf	1872KB	PDF	download

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]

BMC Bioinformatics
Detecting false positive sequence homology: a machine learning approach
Methodology Article
Mark J. Clement¹ M. Stanley Fujimoto¹ Anton Suvorov² Nicholas O. Jensen² Seth M. Bybee²
[1] Computer Science Department, Brigham Young University, 84602, Provo, Utah, USA;Department of Biology, Brigham Young University, 84602, Provo, Utah, USA;
关键词: Homology; Orthology; Paralogy; Machine learning; Evolution; RNA-seq;
DOI : 10.1186/s12859-016-0955-3
received in 2015-05-11, accepted in 2016-02-19, 发布年份 2016
来源: Springer
PDF


	文献评价指标
	下载次数：10次	浏览次数：1次

【 摘 要 】

【 授权许可】

【 预 览 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【参考文献】