• About
  • Policies
  • What is open access
  • Library
  • Contact
Advanced search
      View Item 
      •   BUIR Home
      • Scholarly Publications
      • Faculty of Engineering
      • Department of Computer Engineering
      • View Item
      •   BUIR Home
      • Scholarly Publications
      • Faculty of Engineering
      • Department of Computer Engineering
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Scalable unsupervised ML: Latency hiding in distributed sparse tensor decomposition

      Thumbnail
      View / Download
      901.5 Kb
      Author(s)
      Abubaker, Nabil
      Karsavuran, M. Ozan
      Aykanat, Cevdet
      Date
      2022-11-01
      Source Title
      IEEE Transactions on Parallel and Distributed Systems (TPDS)
      Print ISSN
      10459219
      Publisher
      IEEE Computer Society
      Volume
      33
      Issue
      11
      Pages
      3028 - 3040
      Language
      English
      Type
      Article
      Item Usage Stats
      9
      views
      5
      downloads
      Abstract
      Latency overhead in distributed-memory parallel CPD-ALS scales with the number of processors, limiting the scalability of computing CPD of large irregularly sparse tensors. This overhead comes in the form of sparse reduce and expand operations performed on factor-matrix rows via point-to-point messages. We propose to hide the latency overhead through embedding all of the point-to-point messages incurred by the sparse reduce and expand into dense collective operations which already exist in the CPD-ALS. The conventional parallel CPD-ALS algorithm is not amenable for embedding so we propose a computation/communication rearrangement to enable the embedding. We embed the sparse expand and reduce into a hypercube-based ALL-REDUCE operation to limit the latency overhead to Oðlog 2KÞ for a K-processor system. The embedding comes with the cost of increased bandwidth overhead due to the multi-hop routing of factor-matrix rows during the embedded-ALL-REDUCE. We propose an embedding scheme that takes advantage of the expand/reduce properties to reduce this overhead. Furthermore, we propose a novel recursive bipartitioning framework that enables simultaneous hypergraph partitioning and subhypergraph-to-subhypercube mapping to achieve subtensor-to-processor assignment with the objective of reducing the bandwidth overhead during the embedded-ALL-REDUCE. We also propose a bin-packing-based algorithm for factor-matrix row to processor assignment aiming at reducing processors’ maximum send and receive volumes during the embedded-ALL-REDUCE. Experiments on up to 4096 processors show that the proposed framework scales significantly better than the state-of-the-art point-to-point method.
      Keywords
      Sparse tensor
      Tensor decomposition
      CANDECOMP/PARAFAC
      Canonical polyadic decomposition
      Latency hiding
      Embedded communication
      Communication cost
      Concurrent communication
      Recursive bipartitioning
      Hypergraph partitioning
      Permalink
      http://hdl.handle.net/11693/111701
      Published Version (Please cite this version)
      https://dx.doi.org/10.1109/TPDS.2021.3128827
      Collections
      • Department of Computer Engineering 1561
      Show full item record

      Browse

      All of BUIRCommunities & CollectionsTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartmentsCoursesThis CollectionTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartmentsCourses

      My Account

      Login

      Statistics

      View Usage StatisticsView Google Analytics Statistics

      Bilkent University

      If you have trouble accessing this page and need to request an alternate format, contact the site administrator. Phone: (312) 290 2976
      © Bilkent University - Library IT

      Contact Us | Send Feedback | Off-Campus Access | Admin | Privacy