Browsing by Subject "Graph partitioning"

Now showing 1 - 20 of 22

Open Access
Adaptive decomposition and remapping algorithms for object-space-parallel direct volume rendering of unstructured grids
(Academic Press, 2007-01) Aykanat, Cevdet; Cambazoglu, B. B.; Findik, F.; Kurc, T.
Object space (OS) parallelization of an efficient direct volume rendering algorithm for unstructured grids on distributed-memory architectures is investigated. The adaptive OS decomposition problem is modeled as a graph partitioning (GP) problem using an efficient and highly accurate estimation scheme for view-dependent node and edge weighting. In the proposed model, minimizing the cutsize corresponds to minimizing the parallelization overhead due to the data communication and redundant computation/storage while maintaining the GP balance constraint corresponds to maintaining the computational load balance in parallel rendering. A GP-based, view-independent cell clustering scheme is introduced to induce more tractable view-dependent computational graphs for successive visualizations. As another contribution, a graph-theoretical remapping model is proposed as a solution to the general remapping problem and is used in minimization of the cell-data migration overhead. The remapping tool RM-MeTiS is developed by modifying the GP tool MeTiS and is used in partitioning the remapping graphs. Experiments are conducted using benchmark datasets on a 28-node PC cluster to evaluate the performance of the proposed models. © 2006 Elsevier Inc. All rights reserved.
Open Access
Cascade-aware partitioning of large graph databases
(Springer, 2019) Demirci, Gündüz Vehbi; Ferhatosmanoğlu, H.; Aykanat, Cevdet
Graph partitioning is an essential task for scalable data management and analysis. The current partitioning methods utilize the structure of the graph, and the query log if available. Some queries performed on the database may trigger further operations. For example, the query workload of a social network application may contain re-sharing operations in the form of cascades. It is beneficial to include the potential cascades in the graph partitioning objectives. In this paper, we introduce the problem of cascade-aware graph partitioning that aims to minimize the overall cost of communication among parts/servers during cascade processes. We develop a randomized solution that estimates the underlying cascades, and use it as an input for partitioning of large-scale graphs. Experiments on 17 real social networks demonstrate the effectiveness of the proposed solution in terms of the partitioning objectives.
Open Access
An effective model to decompose linear programs for parallel solution
(Springer, 1996-08) Pınar, Ali; Aykanat, Cevdet
Although inherent parallelism in the solution of block angulax Linear Programming (LP) problems has been exploited in many research works, the literature that addresses decomposing constraint matrices into block angular form for parallel solution is very rare and recent. We have previously proposed hypergraph models, which reduced the problem to the hypergraph partitioning problem. However, the quality of the results reported were limited due to the hypergraph partitioning tools we have used. Very recently, multilevel graph partitioning heuristics have been proposed leading to very successful graph partitioning tools; Chaco and Metis. In this paper, we propose an effective graph model to decompose matrices into block angular form, which reduces the problem to the well-known graph partitioning by vertex separator problem. We have experimented the validity of our proposed model with various LP problems selected from NETLIB and other sources. The results are very attractive both in terms of solution quality and running times. © Springer-Verlag Berlin Heidelberg 1996.
Open Access
Exploiting locality in sparse matrix-matrix multiplication on many-core rchitectures
(IEEE Computer Society, 2017) Akbudak K.; Aykanat, Cevdet
Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form C=A,B on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix A to evenly partition the work across threads with the objective of reducing the number of B-matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for B-matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for C-matrix rows. A similarity graph model is proposed for B-matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations. © 1990-2012 IEEE.
Open Access
Fast shared-memory streaming multilevel graph partitioning
(Elsevier, 2020-09-12) Jafari, N.; Selvitopi, O.; Aykanat, Cevdet
A fast parallel graph partitioner can benefit many applications by reducing data transfers. The online methods for partitioning graphs have to be fast and they often rely on simple one-pass streaming algorithms, while the offline methods for partitioning graphs contain more involved algorithms and the most successful methods in this category belong to the multilevel approaches. In this work, we assess the feasibility of using streaming graph partitioning algorithms within the multilevel framework. Our end goal is to come up with a fast parallel offline multilevel partitioner that can produce competitive cutsize quality. We rely on a simple but fast and flexible streaming algorithm throughout the entire multilevel framework. This streaming algorithm serves multiple purposes in the partitioning process: a clustering algorithm in the coarsening, an effective algorithm for the initial partitioning, and a fast refinement algorithm in the uncoarsening. Its simple nature also lends itself easily for parallelization. The experiments on various graphs show that our approach is on the average up to 5.1x faster than the multi-threaded MeTiS, which comes at the expense of only 2x worse cutsize.
Open Access
Forecasting flight delays using clustered models based on airport networks
(IEEE, 2021) Güvercin, Mehmet; Ferhatosmanoğlu, N.; Gedik, Buğra
Estimating flight delays is important for airlines, airports, and passengers, as the delays are among major costs in air transportation. Each delay may cause a further propagation of delays. Hence, the delay pattern of an airport and the location of the airport in the network can provide useful information for other airports. We address the problem of forecasting flight delays of an airport, utilizing the network information as well as the delay patterns of similar airports in the network. The proposed “Clustered Airport Modeling” (CAM) approach builds a representative time-series for each group of airports and fits a common model (e.g., REG-ARIMA) for each, using the network based features as regressors. The models are then applied individually to each airport data for predicting the airport’s flight delays. We also performed a network based analysis of the airports and identified the Betweenness Centrality (BC) score as an effective feature in forecasting the flight delays. The experiments on flight data over seven years using 305 US airports show that CAM provides accurate forecasts of flight delays.
Open Access
Graph and hypergraph partitioning
(1993) Daşdan, Ali
Graph and hypergraph partitioning have many important applications in various areas such as VLSI layout, mapping, and graph theory. For graph and hypergraph partitioning, there are very successful heuristics mainly based on Kernighan-Lin’s minimization technique. We propose two novel approaches for multiple-way graph and hypergraph partitioning. The proposed algorithms drastically outperform the best multiple-way partitioning algorithm both on randomly generated graph instances and on benchmark circuits. The proposed algorithms convey all the advantages of the algorithms based on KernighanLin’s minimization technique such as their robustness. However, they do not convey many disadvantages of those algorithms such as their poor performance on sparse test cases. The proposed algorithms introduce very interesting ideas that are also applicable to the existing algorithms without very much effort.
Open Access
Hypergraph models for sparse matrix partitioning and reordering
(1999-11) Çatalyürek, Ümit Veysel
Graphs have been widely used to represent sparse matrices for various scientific applications including one-dimensional (ID) decomposition of sparse matrices for parallel sparse-matrix vector multiplication (SpMxV) and sparse matrix reordering for low fill factorization. The standard graph-partitioning based ID decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel SpMxV. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model on ID decomposition. The proposed models reduce the ID decomposition problem to the well-known hypergraph partitioning problem. In the literature, there is a lack of 2D decomposition heuristic which directly minimizes the communication requirements for parallel SpMxV computations. Three novel hypergraph models are introduced for 2D decomposition of sparse matrices for minimizing the communication volume requirement. The first hypergraph model is proposed for fine-grain 2D decomposition of the sparse matrices for parallel SpMxV. The second hypergraph model for 2D decomposition is proposed to produce jagged-like decomposition of the sparse matrix. The checkerboard decomposition based parallel matrix-vector multiplication algorithms are widely encountered in the literature. However, only the load balancing problem is addressed in those works. Here, we propose a new hypergraph model which aims the minimization of communication volume while maintaining the load balance among the processors for checkerboard decomposition, as the third model for 2D decomposition. The proposed model reduces the decomposition problem to the multi-constraint hypergraph partitioning problem. The notion of multi-constraint partitioning has recently become popular in graph partitioning. We applied the multi-constraint partitioning to the hypergraph partitioning problem for solving checkerboard partitioning. Graph partitioning by vertex separator (GPVS) is widely used for nested dissection based low fill ordering of sparse matrices for direct solution of linear systems. In this work, we also show that the GPVS problem can be formulated as hypergraph partitioning. We exploit this finding to develop a novel hypergraph partitioning-based nested dissection ordering. The recently proposed successful multilevel framework is exploited to develop a multilevel hypergraph partitioning tool PaToH for the experimental verification of our proposed hypergraph models. Experimental results on a wide range of realistic sparse test matrices confirm the validity of the proposed hypergraph models. In terms of communication volume, the proposed hypergraph models produce 30% and 59% better decompositions than the graph model in ID and 2D decompositions of sparse matrices for parallel SpMxV computations, respectively. The proposed hypergraph partitioning-based nested dissection produces 25% to 45% better orderings than the widely used multiple mimirnum degree ordering in the ordering of various test matrices arising from different applications.
Open Access
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
(Elsevier BV, 2016) Acer, S.; Selvitopi, O.; Aykanat, Cevdet
We propose a comprehensive and generic framework to minimize multiple and different volume-based communication cost metrics for sparse matrix dense matrix multiplication (SpMM). SpMM is an important kernel that finds application in computational linear algebra and big data analytics. On distributed memory systems, this kernel is usually characterized with its high communication volume requirements. Our approach targets irregularly sparse matrices and is based on both graph and hypergraph partitioning models that rely on the widely adopted recursive bipartitioning paradigm. The proposed models are lightweight, portable (can be realized using any graph and hypergraph partitioning tool) and can simultaneously optimize different cost metrics besides total volume, such as maximum send/receive volume, maximum sum of send and receive volumes, etc., in a single partitioning phase. They allow one to define and optimize as many custom volume-based metrics as desired through a flexible formulation. The experiments on a wide range of about thousand matrices show that the proposed models drastically reduce the maximum communication volume compared to the standard partitioning models that only address the minimization of total volume. The improvements obtained on volume-based partition quality metrics using our models are validated with parallel SpMM as well as parallel multi-source BFS experiments on two large-scale systems. For parallel SpMM, compared to the standard partitioning models, our graph and hypergraph partitioning models respectively achieve reductions of 14% and 22% in runtime, on average. Compared to the state-of-the-art partitioner UMPa, our graph model is overall 14.5 ï¿½ faster and achieves an average improvement of 19% in the partition quality on instances that are bounded by maximum volume. For parallel BFS, we show on graphs with more than a billion edges that the scalability can significantly be improved with our models compared to a recently proposed two-dimensional partitioning model.
Open Access
Improving the performance of 1D vertex parallel GNN training on distributed memory systems
(2024-07) Taşcı, Kutay
Graph Neural Networks (GNNs) are pivotal for analyzing data within graphstructured domains such as social media, biological networks, and recommendation systems. Despite their advantages, scaling GNN training to large datasets in distributed settings poses significant challenges due to the complex task of managing computation and communication costs. The objective of this work is to scale 1D vertex-parallel GNN training on distributed memory systems via (i) twoconstraint partitioning formulation for better computational load balancing and (ii) overlapping communication with computation for reducing communication overhead. In the proposed two-constraint formulation, one constraint encodes the computational load balance during forward propagation, whereas the second constraint encodes the computational load balance during backward propagation. We propose three communication and computation overlapping methods that perform overlapping at three different levels. These methods were tested against traditional approaches using benchmark datasets, demonstrating improved training efficiency without altering the model structure. The outcomes indicate that multi-constraint graph partitioning and the integration of communication and computation overlapping schemes can significantly mitigate the challenges of distributed GNN training. The research concludes with recommendations for future work, including adapting these techniques to dynamic and more complex GNN architectures, promising further improvements in the efficiency and applicability of GNNs in real-world scenarios.
Open Access
Latency-centric models and methods for scaling sparse operations
(2016-07) Selvitopi, Oğuz
Parallelization of sparse kernels and operations on large-scale distributed memory systems remains as a major challenge due to ever-increasing scale of modern high performance computing systems and multiple con icting factors that affect the parallel performance. The low computational density and high memory footprint of sparse operations add to these challenges by implying more stressed communication bottlenecks and make fast and effcient parallelization models and methods imperative for scalable performance. Sparse operations are usually performed with structures related to sparse matrices and matrices are partitioned prior to the execution for distributing computations among processors. Although the literature is rich in this aspect, it still lacks the techniques that embrace multiple factors affecting communication performance in a complete and just manner. In this thesis, we investigate models and methods for intelligent partitioning of sparse matrices that strive for achieving a more correct approximation of the communication performance. To improve the communication performance of parallel sparse operations, we mainly focus on reducing the latency bottlenecks, which stand as a major component in the overall communication cost. Besides these, our approaches consider already adopted communication cost metrics in the literature as well and aim to address as many cost metrics as possible. We propose one-phase and two-phase partitioning models to reduce the latency cost in one-dimensional (1D) and two-dimensional (2D) sparse matrix partitioning, respectively. The model for 1D partitioning relies on the commonly adopted recursive bipartitioning framework and it uses novel structures to capture the relations that incur latency. The models for 2D partitioning aim to improve the performance of solvers for nonsymmetric linear systems by using different partitions for the vectors in the solver and uses that exibility to exploit the latency cost. Our findings indicate that the latency costs should deffinitely be considered in order to achieve scalable performance on distributed memory systems.
Open Access
A novel partitioning method for accelerating the block cimmino algorithm
(Society for Industrial and Applied Mathematics Publications, 2018) Torun, S. F.; Manguoğlu, M.; Aykanat, Cevdet
We propose a novel block-row partitioning method in order to improve the convergence rate of the block Cimmino algorithm for solving general sparse linear systems of equations. The convergence rate of the block Cimmino algorithm depends on the orthogonality among the block rows obtained by the partitioning method. The proposed method takes numerical orthogonality among block rows into account by proposing a row inner-product graph model of the coefficient matrix. In the graph partitioning formulation defined on this graph model, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of interblock inner products between block rows thus leading to an improvement in the eigenvalue spectrum of the iteration matrix. This in turn leads to a significant reduction in the number of iterations required for convergence. Extensive experiments conducted on a large set of matrices confirm the validity of the proposed method against a state-of-the-art method.
Open Access
Parallel direct and hybrid methods based on row block partitioning for solving sparse linear systems
(2017-08) Torun, Fahreddin Şükrü
Solving system of linear equations is a kernel operation in many scienti c and industrial applications. These applications usually give rise to linear systems in which the coe cient matrix is very large and sparse. The need for solving these large and sparse systems within a reasonable time necessitates e cient and e ective parallel solution methods. In this thesis, three novel approaches are proposed for reducing the parallel solution time of linear systems. First, a new parallel algorithm, ParBaMiN, is proposed in order to nd the minimum 2-norm solution of underdetermined linear systems, where the coe cient matrix is in the form of column overlapping block diagonal. The conducted experiments demonstrate the scalability of ParBaMiN on both shared and distributed memory architectures. Secondly, a new graph theoretical partitioning method is introduced in order to reduce the number of iterations in block Cimmino algorithm. Experimental results validate the e ectiveness of the proposed partitioning method in terms of reducing the required number of iterations. Finally, we propose a new parallel hybrid method, BCDcols, which further reduces the number of iterations of block Cimmino algorithm for matrices with dense columns. BCDcols combines the block Cimmino iterative algorithm and a dense direct method for solving the system. Experimental results show that BCDcols signi cantly improves the convergence rate of block Cimmino method and hence reduces the parallel solution time.
Open Access
Parallel direct volume rendering of unstructured grids based on object-space decomposition
(1997-10) Fındık, Ferit
This work investigates object-space (OS) parallelization of an efficient ray-casting based direct volume rendering algorithm (DVR) for unstructured grids on distributed-memory architectures. The key point for a successful parallelization is to find an OS decomposition which maintains the OS coherency and computational load balance as much as possible. The OS decomposition problem is modeled as a graph partitioning (GP) problem with correct view-dependent node and edge weighting. As the parallel visualizations of the results of parallel engineering simulations are performed on the same machine, OS decomposition, which is necessary for each visualization instance because of the changes in the computational structures of the successive parallel steps, constitutes a typical case of the general remapping problem. A GP-based model is proposed for the solution of the general remapping problem by constructing an augmented remapping graph. The remapping tool RM-MeTiS, developed by modifying and enhancing the original MeTiS package for partitioning the remapping graph, is successfully used in the purposed parallel DVR algorithm. An effective view-dependent cell-clustering scheme is introduced to induce more tractable contracted view-independent remapping graphs for successive visualizations. An efficient estimation scheme with high accuracy is proposed for view-dependent node and edge weighting of the remapping graph. Speedup values as high as 22 are obtained on a Parsytec CC system with 24 processors in the visualization of benchmark volumetric datasets and the proposed DVR algorithm seems to be linearly scalable according to the experimental results.
Open Access
Partitioning models for general medium-grain parallel sparse tensor decomposition
(IEEE, 2021) Karsavuran, M. Ozan; Acer, S.; Aykanat, Cevdet
The focus of this article is efficient parallelization of the canonical polyadic decomposition algorithm utilizing the alternating least squares method for sparse tensors on distributed-memory architectures. We propose a hypergraph model for general medium-grain partitioning which does not enforce any topological constraint on the partitioning. The proposed model is based on splitting the given tensor into nonzero-disjoint component tensors. Then a mode-dependent coarse-grain hypergraph is constructed for each component tensor. A net amalgamation operation is proposed to form a composite medium-grain hypergraph from these mode-dependent coarse-grain hypergraphs to correctly encapsulate the minimization of the communication volume. We propose a heuristic which splits the nonzeros of dense slices to obtain sparse slices in component tensors. So we partially attain slice coherency at (sub)slice level since partitioning is performed on (sub)slices instead of individual nonzeros. We also utilize the well-known recursive-bipartitioning framework to improve the quality of the splitting heuristic. Finally, we propose a medium-grain tripartite graph model with the aim of a faster partitioning at the expense of increasing the total communication volume. Parallel experiments conducted on 10 real-world tensors on up to 1024 processors confirm the validity of the proposed hypergraph and graph models.
Open Access
Partitioning models for scaling distributed graph computations
(2019-08) Demirci, Gündüz Vehbi
The focus of this thesis is intelligent partitioning models and methods for scaling the performance of parallel graph computations on distributed-memory systems. Distributed databases utilize graph partitioning to provide servers with data-locality and workload-balance. Some queries performed on a database may form cascades due to the queries triggering each other. The current partitioning methods consider the graph structure and logs of query workload. We introduce the cascade-aware graph partitioning problem with the objective of minimizing the overall cost of communication operations between servers during cascade processes. We propose a randomized algorithm that integrates the graph structure and cascade processes to use as input for large-scale partitioning. Experiments on graphs representing real social networks demonstrate the e ectiveness of the proposed solution in terms of the partitioning objectives. Sparse-general-matrix-multiplication (SpGEMM) is a key computational kernel used in scienti c computing and high-performance graph computations. We propose an SpGEMM algorithm for Accumulo database which enables high performance distributed parallelism through its iterator framework. The proposed algorithm provides write-locality and avoids scanning input matrices multiple times by utilizing Accumulo's batch scanning capability and node-level parallelism structures. We also propose a matrix partitioning scheme that reduces the total communication volume and provides a workload-balance among servers. Extensive experiments performed on both real-world and synthetic sparse matrices show that the proposed algorithm and matrix partitioning scheme provide signi cant performance improvements. Scalability of parallel SpGEMM algorithms are heavily communication bound. Multidimensional partitioning of SpGEMM's workload is essential to achieve higher scalability. We propose hypergraph models that utilize the arrangement of processors and also attain a multidimensional partitioning on SpGEMM's workload. Thorough experimentation performed on both realistic as well as synthetically generated SpGEMM instances demonstrates the e ectiveness of the proposed partitioning models.
Open Access
Partitioning models for scaling parallel sparse matrix-matrix multiplication
(Association for Computing Machinery, 2018) Akbudak, Kadir; Selvitopi, Oğuz; Aykanat, Cevdet
We investigate outer-product--parallel, inner-product--parallel, and row-by-row-product--parallel formulations of sparse matrix-matrix multiplication (SpGEMM) on distributed memory architectures. For each of these three formulations, we propose a hypergraph model and a bipartite graph model for distributing SpGEMM computations based on one-dimensional (1D) partitioning of input matrices. We also propose a communication hypergraph model for each formulation for distributing communication operations. The computational graph and hypergraph models adopted in the first phase aim at minimizing the total message volume and balancing the computational loads of processors, whereas the communication hypergraph models adopted in the second phase aim at minimizing the total message count and balancing the message volume loads of processors. That is, the computational partitioning models reduce the bandwidth cost and the communication hypergraph models reduce the latency cost. Our extensive parallel experiments on up to 2048 processors for a wide range of realistic SpGEMM instances show that although the outer-product--parallel formulation scales better, the row-by-row-product--parallel formulation is more viable due to its significantly lower partitioning overhead and competitive scalability. For computational partitioning models, our experimental findings indicate that the proposed bipartite graph models are attractive alternatives to their hypergraph counterparts because of their lower partitioning overhead. Finally, we show that by reducing the latency cost besides the bandwidth cost through using the communication hypergraph models, the parallel SpGEMM time can be further improved up to 32%.
Open Access
Recursive bipartitioning models for performance improvement in sparse matrix computations
(2017-08) Acer, Seher
Sparse matrix computations are among the most important building blocks of linear algebra and arise in many scienti c and engineering problems. Depending on the problem type, these computations may be in the form of sparse matrix dense matrix multiplication (SpMM), sparse matrix vector multiplication (SpMV), or factorization of a sparse symmetric matrix. For both SpMM and SpMV performed on distributed-memory architectures, the associated data and task partitions among processors a ect the parallel performance in a great extent, especially for the sparse matrices with an irregular sparsity pattern. Parallel SpMM is characterized by high volumes of data communicated among processors, whereas both the volume and number of messages are important for parallel SpMV. For the factorization performed in envelope methods, the envelope size (i.e., pro le) is an important factor which determines the performance. For improving the performance in each of these sparse matrix computations, we propose graph/hypergraph partitioning models that exploit the advantages provided by the recursive bipartitioning (RB) paradigm in order to meet the speci c needs of the respective computation. In the models proposed for SpMM and SpMV, we utilize the RB process to enable targeting multiple volume-based communication cost metrics and the combination of volume- and number-based communication cost metrics in their partitioning objectives, respectively. In the model proposed for the factorization in envelope methods, the input matrix is reordered by utilizing the RB process in which two new quality metrics relating to pro le minimization are de ned and maintained. The experimantal results show that the proposed RB-based approach outperforms the state-of-the-art for each mentioned computation.
Open Access
Reducing communication overhead in sparse matrix and tensor computations
(2020-08) Karsavuran, Mustafa Ozan
Encapsulating multiple communication cost metrics, i.e., bandwidth and latency, is proven to be important in reducing communication overhead in the parallelization of sparse and irregular applications. Communication hypergraph model was proposed in a two-phase setting for encapsulating multiple communication cost metrics. The reduce-communication hypergraph model suﬀers from failing to correctly encapsulate send-volume balancing. We propose a novel vertex weighting scheme that enables part weights to correctly encode send-volume loads of processors for send-volume balancing. The model also suﬀers from increasing the total communication volume during partitioning. To decrease this increase, we propose a method that utilizes the recursive bipartitioning (RB) paradigm and reﬁnes each bipartition by vertex swaps. For performance evaluation, we consider column-parallel SpMV, which is one of the most widely known applications in which the reduce-task assignment problem arises. Extensive experiments on 313 matrices show that, compared to the existing model, the proposed models achieve considerable improvements in all communication cost metrics. These improvements lead to an average decrease of 30 percent in parallel SpMV time on 512 processors for 70 matrices with high irregularity. We further enhance the reduce-communication hypergraph model so that it also encapsulates the minimization of the maximum number of messages sent by a processor. For this purpose, we propose a novel cutsize metric which we realize using RB paradigm while partitioning the reduce-communication hypergraph. We also introduce a new type of net for the communication hypergraph which models decreasing the increase in the total communication volume directly with the partitioning objective. Experiments on 300 matrices show that the proposed models achieve considerable improvements in communication cost metrics which lead to better column-parallel SpMM time on 1024 processors. We propose a hypergraph model for general medium-grain sparse tensor partitioning which does not enforce any topological constraint on the partitioning. The proposed model is based on splitting the given tensor into nonzero-disjoint component tensors. Then a mode-dependent coarse-grain hypergraph is constructed for each component tensor. A net amalgamation operation is proposed to form a composite medium-grain hypergraph from these mode-dependent coarse-grain hypergraphs to correctly encapsulate the minimization of the communication volume. We propose a heuristic which splits the nonzeros of dense slices to obtain sparse slices in component tensors. We also utilize the well-known RB paradigm to improve the quality of the splitting heuristic. We propose a medium-grain tripartite graph model with the aim of a faster partitioning at the expense of increasing the total communication volume. Parallel experiments conducted on 10 real-world tensors on up to 1024 processors conﬁrm the validity of the proposed hypergraph and graph models.
Open Access
Scaling sparse matrix-matrix multiplication in the accumulo database
(Springer, 2020) Demirci, Gündüz Vehbi; Aykanat, Cevdet
We propose and implement a sparse matrix-matrix multiplication (SpGEMM) algorithm running on top of Accumulo’s iterator framework which enables high performance distributed parallelism. The proposed algorithm provides write-locality while ingesting the output matrix back to database via utilizing row-by-row parallel SpGEMM. The proposed solution also alleviates scanning of input matrices multiple times by making use of Accumulo’s batch scanning capability which is used for accessing multiple ranges of key-value pairs in parallel. Even though the use of batch-scanning introduces some latency overheads, these overheads are alleviated by the proposed solution and by using node-level parallelism structures. We also propose a matrix partitioning scheme which reduces the total communication volume and provides a balance of workload among servers. The results of extensive experiments performed on both real-world and synthetic sparse matrices show that the proposed algorithm scales significantly better than the outer-product parallel SpGEMM algorithm available in the Graphulo library. By applying the proposed matrix partitioning, the performance of the proposed algorithm is further improved considerably.