BMC Bioinformatics,2023年
Jörg Fliege, Michael J. Casey, Rubén J. Sánchez-García, Ben D. MacArthur
LicenseType:CC BY |
BackgroundSingle-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information.ResultsHere, we formally quantify the information obtained from a sc-Seq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types.ConclusionsThus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from sc-Seq data, and provide an R package implementation.
BMC Bioinformatics,2017年
Abou Abdallah Malick Diouara, Mohamed Amine Remita, Ahmed Halioui, Abdoulaye Baniré Diallo, Golrokh Kiani, Bruno Daigle
LicenseType:CC BY |
BackgroundAdvances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.ResultsHere, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments.ConclusionThe performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca.
3 Dynamic substrate preferences predict metabolic properties of a simple microbial consortium [期刊论文]
BMC Bioinformatics,2017年
Onur Erbilgin, Rebecca K. Lau, Suzanne M. Kosina, Benjamin P. Bowen, Trent R. Northen, Stefan Jenkins
LicenseType:CC BY |
BackgroundMixed cultures of different microbial species are increasingly being used to carry out a specific biochemical function in lieu of engineering a single microbe to do the same task. However, knowing how different species’ metabolisms will integrate to reach a desired outcome is a difficult problem that has been studied in great detail using steady-state models. However, many biotechnological processes, as well as natural habitats, represent a more dynamic system. Examining how individual species use resources in their growth medium or environment (exometabolomics) over time in batch culture conditions can provide rich phenotypic data that encompasses regulation and transporters, creating an opportunity to integrate the data into a predictive model of resource use by a mixed community.ResultsHere we use exometabolomic profiling to examine the time-varying substrate depletion from a mixture of 19 amino acids and glucose by two Pseudomonas and one Bacillus species isolated from ground water. Contrary to studies in model organisms, we found surprisingly few correlations between resource preferences and maximal growth rate or biomass composition. We then modeled patterns of substrate depletion, and used these models to examine if substrate usage preferences and substrate depletion kinetics of individual isolates can be used to predict the metabolism of a co-culture of the isolates. We found that most of the substrates fit the model predictions, except for glucose and histidine, which were depleted more slowly than predicted, and proline, glycine, glutamate, lysine and arginine, which were all consumed significantly faster.ConclusionsOur results indicate that a significant portion of a model community’s overall metabolism can be predicted based on the metabolism of the individuals. Based on the nature of our model, the resources that significantly deviate from the prediction highlight potential metabolic pathways affected by species-species interactions, which when further studied can potentially be used to modulate microbial community structure and/or function.
BMC Bioinformatics,2017年
M. Krzystanek, Z. Szallasi, O. Pipek, A. Bodor, I. Csabai, D. Ribli, D. Szüts, J. Molnár, G. E. Tusnády, Á. Póti
LicenseType:CC BY |
BackgroundDetection of somatic mutations is one of the main goals of next generation DNA sequencing. A wide range of experimental systems are available for the study of spontaneous or environmentally induced mutagenic processes. However, most of the routinely used mutation calling algorithms are not optimised for the simultaneous analysis of multiple samples, or for non-human experimental model systems with no reliable databases of common genetic variations. Most standard tools either require numerous in-house post filtering steps with scarce documentation or take an unpractically long time to run. To overcome these problems, we designed the streamlined IsoMut tool which can be readily adapted to experimental scenarios where the goal is the identification of experimentally induced mutations in multiple isogenic samples.MethodsUsing 30 isogenic samples, reliable cohorts of validated mutations were created for testing purposes. Optimal values of the filtering parameters of IsoMut were determined in a thorough and strict optimization procedure based on these test sets.ResultsWe show that IsoMut, when tuned correctly, decreases the false positive rate compared to conventional tools in a 30 sample experimental setup; and detects not only single nucleotide variations, but short insertions and deletions as well. IsoMut can also be run more than a hundred times faster than the most precise state of art tool, due its straightforward and easily understandable filtering algorithm.ConclusionsIsoMut has already been successfully applied in multiple recent studies to find unique, treatment induced mutations in sets of isogenic samples with very low false positive rates. These types of studies provide an important contribution to determining the mutagenic effect of environmental agents or genetic defects, and IsoMut turned out to be an invaluable tool in the analysis of such data.
BMC Bioinformatics,2017年
Qin Lu, Hongpeng Wang, Jiyun Zhou, Ruifeng Xu, Yulan He
LicenseType:CC BY |
BackgroundPrediction of DNA-binding residue is important for understanding the protein-DNA recognition mechanism. Many computational methods have been proposed for the prediction, but most of them do not consider the relationships of evolutionary information between residues.ResultsIn this paper, we first propose a novel residue encoding method, referred to as the Position Specific Score Matrix (PSSM) Relation Transformation (PSSM-RT), to encode residues by utilizing the relationships of evolutionary information between residues. PDNA-62 and PDNA-224 are used to evaluate PSSM-RT and two existing PSSM encoding methods by five-fold cross-validation. Performance evaluations indicate that PSSM-RT is more effective than previous methods. This validates the point that the relationship of evolutionary information between residues is indeed useful in DNA-binding residue prediction. An ensemble learning classifier (EL_PSSM-RT) is also proposed by combining ensemble learning model and PSSM-RT to better handle the imbalance between binding and non-binding residues in datasets. EL_PSSM-RT is evaluated by five-fold cross-validation using PDNA-62 and PDNA-224 as well as two independent datasets TS-72 and TS-61. Performance comparisons with existing predictors on the four datasets demonstrate that EL_PSSM-RT is the best-performing method among all the predicting methods with improvement between 0.02–0.07 for MCC, 4.18–21.47% for ST and 0.013–0.131 for AUC. Furthermore, we analyze the importance of the pair-relationships extracted by PSSM-RT and the results validates the usefulness of PSSM-RT for encoding DNA-binding residues.ConclusionsWe propose a novel prediction method for the prediction of DNA-binding residue with the inclusion of relationship of evolutionary information and ensemble learning. Performance evaluation shows that the relationship of evolutionary information between residues is indeed useful in DNA-binding residue prediction and ensemble learning can be used to address the data imbalance issue between binding and non-binding residues. A web service of EL_PSSM-RT (http://hlt.hitsz.edu.cn:8080/PSSM-RT_SVM/) is provided for free access to the biological research community.
BMC Bioinformatics,2017年
Arrate Muñoz-Barrutia, Alejandro Suñé-Auñón, Alvaro Jorge-Peñas, Hans Van Oosterwyck, Rocío Aguilar-Cuenca, Miguel Vicente-Manzanares
LicenseType:CC BY |
BackgroundTraction Force Microscopy (TFM) is a widespread technique to estimate the tractions that cells exert on the surrounding substrate. To recover the tractions, it is necessary to solve an inverse problem, which is ill-posed and needs regularization to make the solution stable. The typical regularization scheme is given by the minimization of a cost functional, which is divided in two terms: the error present in the data or data fidelity term; and the regularization or penalty term. The classical approach is to use zero-order Tikhonov or L2-regularization, which uses the L2-norm for both terms in the cost function. Recently, some studies have demonstrated an improved performance using L1-regularization (L1-norm in the penalty term) related to an increase in the spatial resolution and sensitivity of the recovered traction field. In this manuscript, we present a comparison between the previous two regularization schemes (relying in the L2-norm for the data fidelity term) and the full L1-regularization (using the L1-norm for both terms in the cost function) for synthetic and real data.ResultsOur results reveal that L1-regularizations give an improved spatial resolution (more important for full L1-regularization) and a reduction in the background noise with respect to the classical zero-order Tikhonov regularization. In addition, we present an approximation, which makes feasible the recovery of cellular tractions over whole cells on typical full-size microscope images when working in the spatial domain.ConclusionsThe proposed full L1-regularization improves the sensitivity to recover small stress footprints. Moreover, the proposed method has been validated to work on full-field microscopy images of real cells, what certainly demonstrates it is a promising tool for biological applications.