The last decade has seen an explosion of data arising from the development and proliferation of high-throughput data gathering and analysis pipelines. In order to transform this data into useful hypotheses and conclusions, it is necessary to determine which of it is pertinent to the problem being studied, and sometimes, conversely, which of many hypotheses being considered is best supported by the data at hand. In particular, the field of proteomics often grapples with this challenge, due to being at the confluence of a large number of high-throughput data pipelines. This work presents a series of computational frameworks that address this challenge in a manner that is both computationally efficient and biologically informative, acting as selective filters for the vast amount of data being processed.A system is first presented to vastly reduce the potential combinatoric complexity of post-translational modifications (PTMs) and coding single nucleotide polymorphisms (cSNPs) for Top Down proteomics. Top Down proteomics is uniquely susceptible to a combinatorial explosion; as sequence length increases, the number of potential combinations of mass shift-inducing sequence features increases exponentially. This may be addressed to some extent by the process of shotgun annotation, where combinations of known PTMs and cSNPs are considered. This is in contrast to the rule-based variable modification approach prevalent in Bottom Up proteomics, where all residues of a given type are considered to be potentially modified in a specified manner. However, as high-throughput annotation pipelines vastly increase the number of known modifications and polymorphisms, the number of their combinations grows exponentially and eventually becomes unmanageable. It becomes necessary to restrict the potential combination space in a manner that does not unduly impinge on the identification and characterization capabilities of shotgun annotation. Built as part of a general framework for sequence transformation, the system being presented utilizes a genetic algorithm to identify a group of PTMs and cSNPs that is most suitable for inclusion in a shotgun-annotated sequence database. Additionally, a number of other advancements are presented in the bioinformatics of Top Down proteomics, including a cluster implementation of the ProSight search engine, and a design plan for the next generation of ProSight, built using the principles of online sequence transformation and optimization. This addresses the combinatorial explosion by providing means of efficiently restricting the search space, minimizing the amount of duplicated effort, and leveraging modern processor design to maximize throughput.Second, genetic algorithms are applied to the problem of de novo peptide sequencing in Bottom Up proteomics by means of ultra-high-resolution mass spectrometry. Rather than detecting large numbers of less accurate fragment peaks as is presently typical in Bottom Up proteomics, detecting fragment ions at high resolution results in smaller numbers of highly accurate monoisotopic masses after deisotoping. This allows potential de novo sequence solutions to have exceedingly low fragment mass degeneracy. Presently, algorithms for de novo peptide sequencing that fully take advantage of this capability have been lacking. A system is presented for incorporating numerous metrics of solution quality simultaneously to evolve a sequence solution that best fits available data. The nature of proteomic data and its amenability to analysis by means of genetic algorithms is discussed. This system demonstrates highly confident automatic de novo peptide sequencing using a small number of confident fragment masses, potentially measured at the limits of detection.Third, a system is presented for the efficient discovery of protein-DNA interactions by means of multiple simultaneous gene expression measurements. A major problem in discovering transcription factor binding motifs is that identifying overrepresented sequence motifs is insufficient; most are noise, and some only bind transcription factors under specific biological conditions. It is possible to identify real motifs by the correlation of their presence to differential gene expression under a particular biological condition. By employing multivariate penalized regression, the system described is capable of efficiently identifying transcription factor binding motifs whose presence strongly correlates with gene expression in the measured biological condition from amongst hundreds of candidates. A small, highly confident set of motifs is selected, which may be used for further bioinformatic studies, or as targets for in vivo or in vitro experiments.
【 预 览 】
附件列表
Files
Size
Format
View
The complexity of bioinformatics: techniques for addressing the combinatorial explosion in proteomics and genomics