Browsing by Subject "Matrix completion"

Now showing 1 - 7 of 7

Open Access
Hybrid parallelization of Stochastic Gradient Descent
(2022-02) Büyükkaya, Kemal
The purpose of this study is to investigate the eﬃcient parallelization of the Stochastic Gradient Descent (SGD) algorithm for solving the matrix comple-tion problem on a high-performance computing (HPC) platform in distributed memory setting. We propose a hybrid parallel decentralized SGD framework with asynchronous communication between processors to show the scalability of parallel SGD up to hundreds of processors. We utilize Message Passing In-terface (MPI) for inter-node communication and POSIX threads for intra-node parallelism. We tested our method by using four diﬀerent real-world benchmark datasets. Experimental results show that the proposed algorithm yields up to 6× better throughput on relatively sparse datasets, and displays comparable perfor-mance to available state-of-the-art algorithms on relatively dense datasets while providing a ﬂexible partitioning scheme and a highly scalable hybrid parallel ar-chitecture.
Open Access
Load balanced locality-aware parallel SGD on multicore architectures for latent factor based collaborative filtering
(Elsevier BV * North-Holland, 2023-04-20) Gülcan, Selçuk; Özdal, Muhammet Mustafa; Aykanat, Cevdet
We investigate the parallelization of Stochastic Gradient Descent (SGD) for matrix completion on multicore architectures. We provide an experimental analysis of current SGD algorithms to find out their bottlenecks and limitations. Grid-based methods suffer from load imbalance among 2D blocks of the rating matrix, especially when datasets are skewed and sparse. Asynchronous methods, on the other hand, can face cache issues due to their memory access pattern. We propose bin-packing-based block balancing methods that are alternative to the recently proposed BaPa method. We then introduce Locality Aware SGD (LASGD), a grid-based asynchronous parallel SGD algorithm that efficiently utilizes cache by changing nonzero update sequence without affecting factor update order and carefully arranging latent factor matrices in the memory. Combined with our proposed load balancing methods, our experiments show that LASGD performs significantly better than alternative approaches in parallel shared-memory systems.
Open Access
Minimizing staleness and communication overhead in distributed SGD for collaborative filtering
(IEEE Computer Society, 2023-09-06) Abubaker, Nabil; Caglayan, O.; Karsavuran, M. O.; Aykanat, Cevdet
Distributed asynchronous stochastic gradient descent (ASGD) algorithms that approximate low-rank matrix factorizations for collaborative filtering perform one or more synchronizations per epoch where staleness is reduced with more synchronizations. However, high number of synchronizations would prohibit the scalability of the algorithm. We propose a parallel ASGD algorithm, η-PASGD, for efficiently handling η synchronizations per epoch in a scalable fashion. The proposed algorithm puts an upper limit of KK on η, for a KK-processor system, such that performing Kη=K synchronizations per epoch would eliminate the staleness completely. The rating data used in collaborative filtering are usually represented as sparse matrices. The sparsity allows for reduction in the staleness and communication overhead combinatorially via intelligently distributing the data to processors. We analyze the staleness and the total volume incurred during an epoch of η-PASGD. Following this analysis, we propose a hypergraph partitioning model to encapsulate reducing staleness and volume while minimizing the maximum number of synchronizations required for a stale-free SGD. This encapsulation is achieved with a novel cutsize metric that is realized via a new recursive-bipartitioning-based algorithm. Experiments on up to 512 processors show the importance of the proposed partitioning method in improving staleness, volume, RMSE and parallel runtime.
Open Access
Novel algorithms and models for scaling parallel sparse tensor and matrix factorizations
(2022-07) Abubaker, Nabil F. T.
Two important and widely-used factorization algorithms, namely CPD-ALS for sparse tensor decomposition and distributed stratiﬁed SGD for low-rank matrix factorization, suﬀer from limited scalability. In CPD-ALS, the computational load associated with a tensor/subtensor assigned to a processor is a function of the nonzero counts as well as the ﬁber counts of the tensor when the CSF stor-age is utilized. The tensor ﬁbers fragment as a result of nonzero distributions, which makes balancing the computational loads a hard problem. Two strategies are proposed to tackle the balancing problem on an existing ﬁne-grain hyper-graph model: a novel weighting scheme to cover the cost of ﬁbers in the true load as well as an augmentation to the hypergraph with ﬁber nets to encode reducing the increase in computational load. CPD-ALS also suﬀers from high latency overhead due to the high number of point-to-point messages incurred as the processor count increases. A framework is proposed to limit the number of messages to O(log2 K), for a K-processor system, exchanged in log2 K stages. A hypergraph-based method is proposed to encapsulate the communication of the new log2 K-stage algorithm. In the existing stratiﬁed SGD implementations, the volume of communication is proportional to one of the dimensions of the input matrix and prohibits the scalability. Exchanging the essential data necessary for the correctness of the SSGD algorithm as point-to-point messages is proposed to reduce the volume. This, although invaluable for reducing the band-width overhead, would increase the upper bound on the number of exchanged messages from O(K) to O(K2) rendering the algorithm latency-bound. A novel Hold-and-Combine algorithm is proposed to exchange the essential communication volume with up to O(K logK) messages. Extensive experiments on HPC systems demonstrate the importance of the proposed algorithms and models in scaling CPD-ALS and stratiﬁed SGD.
Open Access
Parallel stochastic gradient descent on multicore architectures
(2020-09) Gülcan, Selçuk
The focus of the thesis is efficient parallelization of the Stochastic Gradient Descent (SGD) algorithm for matrix completion problems on multicore architectures. Asynchronous methods and block-based methods utilizing 2D grid partitioning for task-to-thread assignment are commonly used approaches for sharedmemory parallelization. However, asynchronous methods can have performance issues due to their memory access patterns, whereas grid-based methods can suffer from load imbalance especially when data sets are skewed and sparse. In this thesis, we first analyze parallel performance bottlenecks of the existing SGD algorithms in detail. Then, we propose new algorithms to alleviate these performance bottlenecks. Specifically, we propose bin-packing-based algorithms to balance thread loads under 2D partitioning. We also propose a grid-based asynchronous parallel SGD algorithm that improves cache utilization by changing the entry update order without affecting the factor update order and rearranging the memory layouts of the latent factor matrices. Our experiments show that the proposed methods perform significantly better than the existing approaches on shared-memory multi-core systems.
Open Access
Parallel stochastic gradient descent with sub-iterations on distributed memory systems
(2022-02) Çağlayan, Orhun
We investigate parallelization of the stochastic gradient descent (SGD) algorithm for solving the matrix completion problem. Applications in the literature show that stale data usage and communication costs are important concerns that affect the performance of parallel SGD applications. We first briefly visit the stochastic gradient descent algorithm and matrix partitioning for parallel SGD. Then we define the stale data problem and communication costs. In order to improve the performance of parallel SGD, we propose a new algorithm with intra-iteration synchronization (referred as sub-iterations) to decrease communication costs and stale data usage. Experimental results show that using sub-iterations can de-crease staleness up to 95% and communication volume up to 47%. Furthermore, using sub-iterations can improve test error up to 60% when compared to the conventional parallel SGD implementation that does not use sub-iterations.
Open Access
Scaling stratified stochastic gradient descent for distributed matrix completion
(Institute of Electrical and Electronics Engineers, 2023-10-01) Abubaker, Nabil; Karsavuran, M. O.; Aykanat, Cevdet
Stratified SGD (SSGD) is the primary approach for achieving serializable parallel SGD for matrix completion. State-of-the-art parallelizations of SSGD fail to scale due to large communication overhead. During an SGD epoch, these methods send data proportional to one of the dimensions of the rating matrix. We propose a framework for scalable SSGD through significantly reducing the communication overhead via exchanging point-to-point messages utilizing the sparsity of the rating matrix. We provide formulas to represent the essential communication for correctly performing parallel SSGD and we propose a dynamic programming algorithm for efficiently computing them to establish the point-to-point message schedules. This scheme, however, significantly increases the number of messages sent by a processor per epoch from O(K) to (K2) for a K-processor system which might limit the scalability. To remedy this, we propose a Hold-and-Combine strategy to limit the upper-bound on the number of messages sent per processor to O(KlgK). We also propose a hypergraph partitioning model that correctly encapsulates reducing the communication volume. Experimental results show that the framework successfully achieves a scalable distributed SSGD through significantly reducing the communication overhead. Our code is publicly available at: github.com/nfabubaker/CESSGD