Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

Abubaker, Nabil; Karsavuran, M. Ozan; Aykanat, Cevdet

Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

Files

Scalable_Unsupervised_ML_Latency_Hiding_in_Distributed_Sparse_Tensor_Decomposition.pdf (901.54 KB)

Date

2022-11-01

Authors

Abubaker, Nabil

Karsavuran, M. Ozan

Aykanat, Cevdet

BUIR Usage Stats

2
views

42
downloads

Citation Stats

Abstract

Latency overhead in distributed-memory parallel CPD-ALS scales with the number of processors, limiting the scalability of computing CPD of large irregularly sparse tensors. This overhead comes in the form of sparse reduce and expand operations performed on factor-matrix rows via point-to-point messages. We propose to hide the latency overhead through embedding all of the point-to-point messages incurred by the sparse reduce and expand into dense collective operations which already exist in the CPD-ALS. The conventional parallel CPD-ALS algorithm is not amenable for embedding so we propose a computation/communication rearrangement to enable the embedding. We embed the sparse expand and reduce into a hypercube-based ALL-REDUCE operation to limit the latency overhead to Oðlog 2KÞ for a K-processor system. The embedding comes with the cost of increased bandwidth overhead due to the multi-hop routing of factor-matrix rows during the embedded-ALL-REDUCE. We propose an embedding scheme that takes advantage of the expand/reduce properties to reduce this overhead. Furthermore, we propose a novel recursive bipartitioning framework that enables simultaneous hypergraph partitioning and subhypergraph-to-subhypercube mapping to achieve subtensor-to-processor assignment with the objective of reducing the bandwidth overhead during the embedded-ALL-REDUCE. We also propose a bin-packing-based algorithm for factor-matrix row to processor assignment aiming at reducing processors’ maximum send and receive volumes during the embedded-ALL-REDUCE. Experiments on up to 4096 processors show that the proposed framework scales significantly better than the state-of-the-art point-to-point method.

Source Title

IEEE Transactions on Parallel and Distributed Systems (TPDS)

Publisher

IEEE Computer Society

Keywords

Sparse tensor, Tensor decomposition, CANDECOMP/PARAFAC, Canonical polyadic decomposition, Latency hiding, Embedded communication, Communication cost, Concurrent communication, Recursive bipartitioning, Hypergraph partitioning

Permalink

http://hdl.handle.net/11693/111701

Published Version (Please cite this version)

https://dx.doi.org/10.1109/TPDS.2021.3128827

Collections

Scholarly Publications - Computer Engineering

Language

English

Type

Article

Full item page

Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Citation Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Citation Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type