Journal of computational biology: A journal of computational molecular cell biology | |
Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data | |
article | |
Sazan Mahbub1  Shashata Sawmya1  Arpita Saha1  Rezwana Reaz1  M. Sohel Rahman1  Shamsuzzoha Bayzid1  | |
[1] Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology;Department of Computer Science, University of Maryland, College Park | |
关键词: gene tree; gene tree discordance; incomplete lineage sorting; quartet consistency; quartet distribution; species tree; missing data; gene tree imputation; | |
DOI : 10.1089/cmb.2022.0212 | |
学科分类:生物科学(综合) | |
来源: Mary Ann Liebert, Inc. Publishers | |
【 摘 要 】
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
【 授权许可】
Unknown
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202307010001634ZK.pdf | 2405KB | download |