In this thesis, we formulate the Gauss-Newton algorithm to make it viable for running on distributed memory architectures and comparative to Alternating least squares algorithm for CP decomposition. Alternating least squares may exhibit slow or no convergence, especially when high accuracy is required. CP decomposition problem can be formulated as a non-linear least squares problem to apply iterative Newton-like methods. Direct solution of linear systems involving an approximated Hessian is an expensive approach, however, recent advancements have shown that use of an implicit representation of the linear system makes these methods competitive with alternating least squares in terms of speed. We provide a parallel implementation of a Gauss-Newton method for CP decomposition, which iteratively solves linear least squares problems at each Gauss-Newton step. In particular, we leverage a formulation that employs tensor contractions for implicit matrix-vector products within the conjugate gradient method. The use of tensor contractions enables us to employ the Cyclops library for distributed-memory tensor computations to parallelize the Gauss-Newton approach with a high-level Python implementation. In addition to that, we introduce a regularization scheme for the Gauss-Newton method which shows better convergence results across a variety of tensors. We study the convergence of variants of the Gauss-Newton method relative to Alternating least squares for finding exact CP decompositions as well as approximate decompositions of real-world tensors.