Various high-throughput technologies have fueled advances in biomedical research in the last decade. Two typical examples are gene expression and genomic hybridization microarrays that quantify RNA and DNA levels respectively. High-dimensional data sets generated by these technologies presented novel opportunities to discover relationships not only among interrogating probes (i.e genes) but also among interrogated specimens (i.e samples). At the same time, however, the necessity to model the variability within and between different high-throughput platforms has created novel statistical challenges. In this thesis, I address the opportunities and challenges with three algorithms. First, I present DynBoost, a new method to infer gene-gene dependence relationships and nonlinear dynamics in gene regulatory networks. DynBoost is a flexible boosting algorithm that shares features from L2-boosting and randomization-based algorithms to perform the tasks of parameter learning and network inference. The performance of the proposed algorithm was evaluated on a number of benchmark data sets from the DREAM3 challenge and the results strongly indicated that it outperformed existing approaches. Second, I revisit consensus clustering (CC) and some other clustering methods in the context of unsupervised sample subtype discovery. I show that many unsupervised partitioning methods are able to divide homogeneous data into pre-specified numbers of clusters, and CC is able to show apparent stability of such chance partitioning of random data. I conclude that CC is a powerful tool for minimizing false negatives in the presence of genuine structure, but can lead to false positives in the exploratory phase of many studies if the implementation and inference are not carried out with caution in line with particular prudent practices. Lastly, I present MPCBS, a new method that integrates DNA copy number analysis across different platforms by pooling statistical evidence during segmentation. I show by comparing the integrated analysis of Affymetrix and Illumina SNP array data with Agilent and fosmid clone end-sequencing results on 8 HapMap samples that MPCBS achieves improved spatial resolution, detection power, and provides a natural consensus across platforms.
【 预 览 】
附件列表
Files
Size
Format
View
Developing and Application of Statistical Algorithms for High-Demensional Biological Data Analysis