The class of computational problems I consider in this thesis share the common trait of requiringconsideration of pairs (or higher-order tuples)of data points. I focus on the problem of kernel summation operations ubiquitous in many data mining and scientific algorithms.In machine learning, kernel summations appear in popular kernel methods which can model nonlinearstructures in data. Kernel methods include manynon-parametric methods such as kernel density estimation, kernel regression, Gaussian processregression, kernel PCA, and kernel support vectormachines (SVM). In computational physics, kernel summations occur inside the classicalN-body problem for simulating positions of a setof celestial bodies or atoms.This thesis attempts to marry, for the first time, the best relevant techniques in parallelcomputing, where kernel summations are in lowdimensions, with the best general-dimension algorithms from the machine learning literature.We provide a unified, efficient parallelkernel summation framework that can utilize:(1) various types of deterministic and probabilistic approximations that may besuitable for both low and high-dimensional problems with a large number of data points;(2) indexing the data using any multi-dimensionalbinary tree with both distributed memory (MPI)and shared memory (OpenMP/Intel TBB) parallelism;(3) a dynamic load balancing scheme to adjustwork imbalances during the computation.I will first summarize my previous research in serial kernel summation algorithms. This workstarted from Greengard/Rokhlin's earlier work onfast multipole methods for the purpose of approximating potential sums of many particles. The contributions of this part of this thesisinclude the followings: (1) reinterpretation of Greengard/Rokhlin's work for the computer sciencecommunity; (2) the extension of the algorithms touse a larger class of approximation strategies,i.e. probabilistic error bounds via Monte Carlo techniques; (3) the multibody series expansion:the generalization of the theory of fast multipole methods to handle interactions of more than two entities; (4) the first O(N) proof of the batch approximate kernel summation using anotion of intrinsic dimensionality. Then I moveonto the problem of parallelization of the kernel summations and tackling the scaling of two otherkernel methods, Gaussian process regression(kernel matrix inversion) and kernel PCA (kernel matrix eigendecomposition).The artifact of this thesis has contributed to an open-source machine learning package calledMLPACK which has been first demonstrated at the NIPS 2008 and subsequently at the NIPS 2011 BigLearning Workshop. Completing a portion of this thesis involved utilization of high performancecomputing resource at XSEDE (eXtreme Science andEngineering Discovery Environment) and NERSC(National Energy Research Scientific Computing Center).
【 预 览 】
附件列表
Files
Size
Format
View
A distributed kernel summation framework for machine learning and scientific applications