Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

Abubaker, Nabil; Karsavuran, M. Ozan; Aykanat, Cevdet

Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

buir.contributor.author	Abubaker, Nabil
buir.contributor.author	Karsavuran, M. Ozan
buir.contributor.author	Aykanat, Cevdet
buir.contributor.orcid	Abubaker, Nabil\|0000-0002-5060-3059
buir.contributor.orcid	Karsavuran, M. Ozan\|0000-0002-0298-3034
buir.contributor.orcid	Aykanat, Cevdet\|0000-0002-4559-1321
dc.citation.epage	3040	en_US
dc.citation.issueNumber	11	en_US
dc.citation.spage	3028	en_US
dc.citation.volumeNumber	33	en_US
dc.contributor.author	Abubaker, Nabil
dc.contributor.author	Karsavuran, M. Ozan
dc.contributor.author	Aykanat, Cevdet
dc.date.accessioned	2023-02-24T14:17:58Z
dc.date.available	2023-02-24T14:17:58Z
dc.date.issued	2022-11-01
dc.department	Department of Computer Engineering	en_US
dc.description.abstract	Latency overhead in distributed-memory parallel CPD-ALS scales with the number of processors, limiting the scalability of computing CPD of large irregularly sparse tensors. This overhead comes in the form of sparse reduce and expand operations performed on factor-matrix rows via point-to-point messages. We propose to hide the latency overhead through embedding all of the point-to-point messages incurred by the sparse reduce and expand into dense collective operations which already exist in the CPD-ALS. The conventional parallel CPD-ALS algorithm is not amenable for embedding so we propose a computation/communication rearrangement to enable the embedding. We embed the sparse expand and reduce into a hypercube-based ALL-REDUCE operation to limit the latency overhead to Oðlog 2KÞ for a K-processor system. The embedding comes with the cost of increased bandwidth overhead due to the multi-hop routing of factor-matrix rows during the embedded-ALL-REDUCE. We propose an embedding scheme that takes advantage of the expand/reduce properties to reduce this overhead. Furthermore, we propose a novel recursive bipartitioning framework that enables simultaneous hypergraph partitioning and subhypergraph-to-subhypercube mapping to achieve subtensor-to-processor assignment with the objective of reducing the bandwidth overhead during the embedded-ALL-REDUCE. We also propose a bin-packing-based algorithm for factor-matrix row to processor assignment aiming at reducing processors’ maximum send and receive volumes during the embedded-ALL-REDUCE. Experiments on up to 4096 processors show that the proposed framework scales significantly better than the state-of-the-art point-to-point method.	en_US
dc.identifier.doi	10.1109/TPDS.2021.3128827	en_US
dc.identifier.issn	10459219	en_US
dc.identifier.uri	http://hdl.handle.net/11693/111701	en_US
dc.language.iso	English	en_US
dc.publisher	IEEE Computer Society	en_US
dc.relation.isversionof	https://dx.doi.org/10.1109/TPDS.2021.3128827	en_US
dc.source.title	IEEE Transactions on Parallel and Distributed Systems (TPDS)	en_US
dc.subject	Sparse tensor	en_US
dc.subject	Tensor decomposition	en_US
dc.subject	CANDECOMP/PARAFAC	en_US
dc.subject	Canonical polyadic decomposition	en_US
dc.subject	Latency hiding	en_US
dc.subject	Embedded communication	en_US
dc.subject	Communication cost	en_US
dc.subject	Concurrent communication	en_US
dc.subject	Recursive bipartitioning	en_US
dc.subject	Hypergraph partitioning	en_US
dc.title	Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Scalable_Unsupervised_ML_Latency_Hiding_in_Distributed_Sparse_Tensor_Decomposition.pdf
Size:: 901.54 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.69 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Scholarly Publications - Computer Engineering