Browsing by Author "Abubaker, Nabil"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item Open Access Minimizing staleness and communication overhead in distributed SGD for collaborative filtering(IEEE Computer Society, 2023-09-06) Abubaker, Nabil; Caglayan, O.; Karsavuran, M. O.; Aykanat, CevdetDistributed asynchronous stochastic gradient descent (ASGD) algorithms that approximate low-rank matrix factorizations for collaborative filtering perform one or more synchronizations per epoch where staleness is reduced with more synchronizations. However, high number of synchronizations would prohibit the scalability of the algorithm. We propose a parallel ASGD algorithm, η-PASGD, for efficiently handling η synchronizations per epoch in a scalable fashion. The proposed algorithm puts an upper limit of KK on η, for a KK-processor system, such that performing Kη=K synchronizations per epoch would eliminate the staleness completely. The rating data used in collaborative filtering are usually represented as sparse matrices. The sparsity allows for reduction in the staleness and communication overhead combinatorially via intelligently distributing the data to processors. We analyze the staleness and the total volume incurred during an epoch of η-PASGD. Following this analysis, we propose a hypergraph partitioning model to encapsulate reducing staleness and volume while minimizing the maximum number of synchronizations required for a stale-free SGD. This encapsulation is achieved with a novel cutsize metric that is realized via a new recursive-bipartitioning-based algorithm. Experiments on up to 512 processors show the importance of the proposed partitioning method in improving staleness, volume, RMSE and parallel runtime.Item Open Access Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition(IEEE Computer Society, 2022-11-01) Abubaker, Nabil; Karsavuran, M. Ozan; Aykanat, CevdetLatency overhead in distributed-memory parallel CPD-ALS scales with the number of processors, limiting the scalability of computing CPD of large irregularly sparse tensors. This overhead comes in the form of sparse reduce and expand operations performed on factor-matrix rows via point-to-point messages. We propose to hide the latency overhead through embedding all of the point-to-point messages incurred by the sparse reduce and expand into dense collective operations which already exist in the CPD-ALS. The conventional parallel CPD-ALS algorithm is not amenable for embedding so we propose a computation/communication rearrangement to enable the embedding. We embed the sparse expand and reduce into a hypercube-based ALL-REDUCE operation to limit the latency overhead to Oðlog 2KÞ for a K-processor system. The embedding comes with the cost of increased bandwidth overhead due to the multi-hop routing of factor-matrix rows during the embedded-ALL-REDUCE. We propose an embedding scheme that takes advantage of the expand/reduce properties to reduce this overhead. Furthermore, we propose a novel recursive bipartitioning framework that enables simultaneous hypergraph partitioning and subhypergraph-to-subhypercube mapping to achieve subtensor-to-processor assignment with the objective of reducing the bandwidth overhead during the embedded-ALL-REDUCE. We also propose a bin-packing-based algorithm for factor-matrix row to processor assignment aiming at reducing processors’ maximum send and receive volumes during the embedded-ALL-REDUCE. Experiments on up to 4096 processors show that the proposed framework scales significantly better than the state-of-the-art point-to-point method.Item Open Access Scaling stratified stochastic gradient descent for distributed matrix completion(Institute of Electrical and Electronics Engineers, 2023-10-01) Abubaker, Nabil; Karsavuran, M. O.; Aykanat, CevdetStratified SGD (SSGD) is the primary approach for achieving serializable parallel SGD for matrix completion. State-of-the-art parallelizations of SSGD fail to scale due to large communication overhead. During an SGD epoch, these methods send data proportional to one of the dimensions of the rating matrix. We propose a framework for scalable SSGD through significantly reducing the communication overhead via exchanging point-to-point messages utilizing the sparsity of the rating matrix. We provide formulas to represent the essential communication for correctly performing parallel SSGD and we propose a dynamic programming algorithm for efficiently computing them to establish the point-to-point message schedules. This scheme, however, significantly increases the number of messages sent by a processor per epoch from O(K) to (K2) for a K-processor system which might limit the scalability. To remedy this, we propose a Hold-and-Combine strategy to limit the upper-bound on the number of messages sent per processor to O(KlgK). We also propose a hypergraph partitioning model that correctly encapsulates reducing the communication volume. Experimental results show that the framework successfully achieves a scalable distributed SSGD through significantly reducing the communication overhead. Our code is publicly available at: github.com/nfabubaker/CESSGDItem Open Access Spatiotemporal graph and hypergraph partitioning models for sparse matrix-vector multiplication on many-core architectures(IEEE Computer Society, 2019) Abubaker, Nabil; Akbudak, K.; Aykanat, CevdetThere exist graph/hypergraph partitioning-based row/column reordering methods for encoding either spatial or temporal locality for sparse matrix-vector multiplication (SpMV) operations. Spatial and temporal hypergraph models in these methods are extended to encapsulate both spatial and temporal localities based on cut/uncut net categorization obtained from vertex partitioning. These extensions of spatial and temporal hypergraph models encode the spatial locality primarily and the temporal locality secondarily, and vice-versa, respectively. However, the literature lacks models that simultaneously encode both spatial and temporal localities utilizing only vertex partitioning for further improving the performance of SpMV on shared-memory architectures. In order to fill this gap, we propose a novel spatiotemporal hypergraph model that leads to a one-phase spatiotemporal reordering method which encodes both types of locality simultaneously. We also propose a framework for spatiotemporal methods which encodes both types of locality in two dependent phases and two separate phases. The validity of the proposed spatiotemporal models and methods are tested on a wide range of sparse matrices and the experiments are performed on both a 60-core Intel Xeon Phi processor and a Xeon processor. Results show the validity of the methods via almost doubling the Gflop/s performance through enhancing data locality in parallel SpMV operations.Item Open Access True load balancing for Matricized Tensor Times Khatri-Rao Product(IEEE, 2021-01-22) Abubaker, Nabil; Aykanat, Cevdet; Acer, S.MTTKRP is the bottleneck operation in algorithms used to compute the CP tensor decomposition. For sparse tensors, utilizing the compressed sparse fibers (CSF) storage format and the CSF-oriented MTTKRP algorithms is important for both memory and computational efficiency on distributed-memory architectures. Existing intelligent tensor partitioning models assume the computational cost of MTTKRP to be proportional to the total number of nonzeros in the tensor. However, this is not the case for the CSF-oriented MTTKRP on distributed-memory architectures. We outline two deficiencies of nonzero-based intelligent partitioning models when CSF-oriented MTTKRP operations are performed locally: failure to encode processors' computational loads and increase in total computation due to fiber fragmentation. We focus on existing fine-grain hypergraph model and propose a novel vertex weighting scheme that enables this model encode correct computational loads of processors. We also propose to augment the fine-grain model by fiber nets for reducing the increase in total computational load via minimizing fiber fragmentation. In this way, the proposed model encodes minimizing the load of the bottleneck processor. Parallel experiments with real-world sparse tensors on up to 1024 processors prove the validity of the outlined deficiencies and demonstrate the merit of our proposed improvements in terms of parallel runtimes.