In drug discovery, thousands of compounds are assayed to detect activity against abiological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the largevolume of compounds tested by high-throughput screening, and the complexity ofmolecular structure and its relationship to activity.This thesis focuses on the design of statistical learning algorithms/models andtheir applications to drug discovery. The two main parts of the thesis are: analgorithm-based statistical method and a more formal model-based approach. Bothapproaches can facilitate and accelerate the process of developing new drugs. Aunifying theme is the use of unsupervised methods as components of supervisedlearning algorithms/models.In the first part of the thesis, we explore a sequential screening approach, ClusterStructure-Activity Relationship Analysis (CSARA). Sequential screening integratesHigh Throughput Screening with mathematical modeling to sequentially select thebest compounds. CSARA is a cluster-based and algorithm driven method. Togain further insight into this method, we use three carefully designed experimentsto compare predictive accuracy with Recursive Partitioning, a popular structureactivityrelationship analysis method. The experiments show that CSARA outperformsRecursive Partitioning. Comparisons include problems with many descriptorsets and situations in which many descriptors are not important for activity.In the second part of the thesis, we propose and develop constrained mixturediscriminant analysis (CMDA), a model-based method. The main idea of CMDAis to model the distribution of the observations given the class label (e.g. activeor inactive class) as a constrained mixture distribution, and then use Bayes’ ruleto predict the probability of being active for each observation in the testing set.Constraints are used to deal with the otherwise explosive growth of the numberof parameters with increasing dimensionality. CMDA is designed to solve severalchallenges in modeling drug data sets, such as multiple mechanisms, the rare targetproblem (i.e. imbalanced classes), and the identification of relevant subspaces ofdescriptors (i.e. variable selection).We focus on the CMDA1 model, in which univariate densities form the buildingblocks of the mixture components. Due to the unboundedness of the CMDA1 loglikelihood function, it is easy for the EM algorithm to converge to degenerate solutions.A special Multi-Step EM algorithm is therefore developed and explored viaseveral experimental comparisons. Using the multi-step EM algorithm, the CMDA1model is compared to model-based clustering discriminant analysis (MclustDA).The CMDA1 model is either superior to or competitive with the MclustDA model,depending on which model generates the data. The CMDA1 model has betterperformance than the MclustDA model when the data are high-dimensional andunbalanced, an essential feature of the drug discovery problem!An alternate approach to the problem of degeneracy is penalized estimation. Byintroducing a group of simple penalty functions, we consider penalized maximumlikelihood estimation of the CMDA1 and CMDA2 models. This strategy improvesthe convergence of the conventional EM algorithm, and helps avoid degeneratesolutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’sof the two-dimensional CMDA1 model can be asymptotically consistent.
【 预 览 】
附件列表
Files
Size
Format
View
Statistical Learning in Drug Discovery via Clustering and Mixtures