Browsing by Author "Karsavuran, M. Ozan"

Now showing 1 - 5 of 5

Open Access
Partitioning models for general medium-grain parallel sparse tensor decomposition
(IEEE, 2021) Karsavuran, M. Ozan; Acer, S.; Aykanat, Cevdet
The focus of this article is efficient parallelization of the canonical polyadic decomposition algorithm utilizing the alternating least squares method for sparse tensors on distributed-memory architectures. We propose a hypergraph model for general medium-grain partitioning which does not enforce any topological constraint on the partitioning. The proposed model is based on splitting the given tensor into nonzero-disjoint component tensors. Then a mode-dependent coarse-grain hypergraph is constructed for each component tensor. A net amalgamation operation is proposed to form a composite medium-grain hypergraph from these mode-dependent coarse-grain hypergraphs to correctly encapsulate the minimization of the communication volume. We propose a heuristic which splits the nonzeros of dense slices to obtain sparse slices in component tensors. So we partially attain slice coherency at (sub)slice level since partitioning is performed on (sub)slices instead of individual nonzeros. We also utilize the well-known recursive-bipartitioning framework to improve the quality of the splitting heuristic. Finally, we propose a medium-grain tripartite graph model with the aim of a faster partitioning at the expense of increasing the total communication volume. Parallel experiments conducted on 10 real-world tensors on up to 1024 processors confirm the validity of the proposed hypergraph and graph models.
Open Access
Reduce operations: send volume balancing while minimizing latency
(IEEE, 2020) Karsavuran, M. Ozan; Acer, S.; Aykanat, Cevdet
Communication hypergraph model was proposed in a two-phase setting for encapsulating multiple communication cost metrics (bandwidth and latency), which are proven to be important in parallelizing irregular applications. In the first phase, computational-task-to-processor assignment is performed with the objective of minimizing total volume while maintaining computational load balance. In the second phase, communication-task-to-processor assignment is performed with the objective of minimizing total number of messages while maintaining communication-volume balance. The reduce-communication hypergraph model suffers from failing to correctly encapsulate send-volume balancing. We propose a novel vertex weighting scheme that enables part weights to correctly encode send-volume loads of processors for send-volume balancing. The model also suffers from increasing the total communication volume during partitioning. To decrease this increase, we propose a method that utilizes the recursive bipartitioning framework and refines each bipartition by vertex swaps. For performance evaluation, we consider column-parallel SpMV, which is one of the most widely known applications in which the reduce-task assignment problem arises. Extensive experiments on 313 matrices show that, compared to the existing model, the proposed models achieve considerable improvements in all communication cost metrics. These improvements lead to an average decrease of 30 percent in parallel SpMV time on 512 processors for 70 matrices with high irregularity.
Open Access
Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition
(IEEE Computer Society, 2022-11-01) Abubaker, Nabil; Karsavuran, M. Ozan; Aykanat, Cevdet
Latency overhead in distributed-memory parallel CPD-ALS scales with the number of processors, limiting the scalability of computing CPD of large irregularly sparse tensors. This overhead comes in the form of sparse reduce and expand operations performed on factor-matrix rows via point-to-point messages. We propose to hide the latency overhead through embedding all of the point-to-point messages incurred by the sparse reduce and expand into dense collective operations which already exist in the CPD-ALS. The conventional parallel CPD-ALS algorithm is not amenable for embedding so we propose a computation/communication rearrangement to enable the embedding. We embed the sparse expand and reduce into a hypercube-based ALL-REDUCE operation to limit the latency overhead to Oðlog 2KÞ for a K-processor system. The embedding comes with the cost of increased bandwidth overhead due to the multi-hop routing of factor-matrix rows during the embedded-ALL-REDUCE. We propose an embedding scheme that takes advantage of the expand/reduce properties to reduce this overhead. Furthermore, we propose a novel recursive bipartitioning framework that enables simultaneous hypergraph partitioning and subhypergraph-to-subhypercube mapping to achieve subtensor-to-processor assignment with the objective of reducing the bandwidth overhead during the embedded-ALL-REDUCE. We also propose a bin-packing-based algorithm for factor-matrix row to processor assignment aiming at reducing processors’ maximum send and receive volumes during the embedded-ALL-REDUCE. Experiments on up to 4096 processors show that the proposed framework scales significantly better than the state-of-the-art point-to-point method.
Open Access
Simultaneous computational and data load balancing in distributed-memory setting
(SIAM, 2022) Çeliktuğ, Mestan Fırat; Karsavuran, M. Ozan; Acer, Seher; Aykanat, Cevdet; Sterck, Hans De
Several successful partitioning models and methods have been proposed and used for computational load balancing of irregularly sparse applications in a distributed-memory setting. However, the literature lacks partitioning models and methods that encode both computational and data load balancing. In this article, we try to close this gap in the literature by proposing two hypergraph partitioning (HP) models which simultaneously encode computational and data load balancing. Both models utilize a two-constraint formulation, where the first constraint encodes the computational loads and the second constraint encodes the data loads. In the first model, we introduce explicit data vertices for encoding data load and we replicate those data vertices at each recursive bipartitioning (RB) step for encoding data replication. In the second model, we introduce a data weight distribution scheme for encoding data load and we update those weights at each RB step. The nice property of both proposed models is that they do not necessitate developing a new partitioner from scratch. Both models can easily be implemented by invoking any HP tool that supports multiconstraint partitioning as a two-way partitioner at each RB step. The validity of the proposed models are tested on two widely used irregularly sparse applications: parallel mesh simulations and parallel sparse matrix sparse matrix multiplication. Both proposed models achieve significant improvement over a baseline model.
Open Access
Stochastic Gradient Descent for matrix completion: hybrid parallelization on shared- and distributed-memory systems
(ELSEVIER BV, 2024-01-11) Büyükkaya, Kemal; Karsavuran, M. Ozan; Aykanat, Cevdet
The purpose of this study is to investigate the hybrid parallelization of the Stochastic Gradient Descent (SGD) algorithm for solving the matrix completion problem on a high-performance computing platform. We propose a hybrid parallel decentralized SGD framework with asynchronous inter-process communication and a novel flexible partitioning scheme to attain scalability up to hundreds of processors. We utilize Message Passing Interface (MPI) for inter-node communication and POSIX threads for intra-node parallelism. We tested our method by using different real-world benchmark datasets. Experimental results on a hybrid parallel architecture showed that, compared to the state-of-the-art, the proposed algorithm achieves 6x higher throughput on sparse datasets, while it achieves comparable throughput on relatively dense datasets.