Lung diseases include some of the most widespread and deadly conditions known to affect people around the world. The challenges of clinically tackling lung diseases arise from multiple factors associated with both diagnosis and treatment strategies. In my thesis work, I addressed the challenges of diagnosis and treatment by leveraging tools in computational systems biology that try to account for the effects of the multitude of molecular components and their interactions that comprise both the human host and microbial pathogens. The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study.This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies.In the first part of my thesis work, I quantified the impact of these combined "study-effects" on lung disease signatures’ predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. As a case study, I gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies.I found that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. In this work, I showed that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.In the second part of my thesis, I turned my focus to an important lung pathogen to engage the challenge, ultimately, of therapeutic development. I focused on tuberculosis and its causative pathogen, Mycobacterium tuberculosis (MTB), which causes over a million deaths worldwide each year and has strains resistant to antibiotics. My goal was to better understand the underlying mechanisms associated with MTB response under different genetic and environmental perturbations of metabolism and regulation, which may translate into novel drug targets. As a step toward this goal, I expanded a genome-scale regulatory-metabolic model for MTB using the Probabilistic Regulation of Metabolism (PROM) framework. Our model represents a substantial knowledge base update, incorporating a ChIP-seq based transcription factor binding network containing 2556 interactions linking 104 transcription factors (TFs) to 647 metabolic genes, as well as an expanded metabolic model that can predict growth in a broad range of environmental conditions. My expanded model improves agreement between predicted growth effects of TF knockouts and gene essentiality experiments compared to the original PROM MTB model and can successfully predict growth defects associated with TF overexpression. The simulated growth predictions identify perturbations that lead to condition-specific growth defects, generating experimentally testable hypotheses of mechanisms underlying perturbation-induced phenotypes in MTB.
【 预 览 】
附件列表
Files
Size
Format
View
Systems biology of lung diseases: a multi-perspective view from host and pathogen