Browsing by Author "Selvitopi, O."

Now showing 1 - 7 of 7

Open Access
The effect of various sparsity structures on parallelism and algorithms to reveal those structures
(Birkhauser, 2020) Selvitopi, O.; Acer, S.; Manguoğlu, M.; Aykanat, Cevdet
Structured sparse matrices can greatly benefit parallel numerical methods in terms of parallel performance and convergence. In this chapter, we present combinatorial models for obtaining several different sparse matrix forms. There are four basic forms we focus on: singly-bordered block-diagonal form, doubly-bordered block-diagonal form, nonempty off-diagonal block minimization, and block diagonal with overlap form. For each of these forms, we first present the form in detail and describe what goals are sought within the form, and then examine the combinatorial models that attain the respective form while targeting the sought goals, and finally explain in which aspects the forms benefit certain parallel numerical methods and their relationship with the models. Our work focuses especially on graph and hypergraph partitioning models in obtaining the mentioned forms. Despite their relatively high preprocessing overhead compared to other heuristics, they have proven to model the given problem more accurately and this overhead can be often amortized due the fact that matrix structure does not change much during a typical numerical simulation. This chapter presents a number of models and their relationship with parallel numerical methods.
Open Access
Fast shared-memory streaming multilevel graph partitioning
(Elsevier, 2020-09-12) Jafari, N.; Selvitopi, O.; Aykanat, Cevdet
A fast parallel graph partitioner can benefit many applications by reducing data transfers. The online methods for partitioning graphs have to be fast and they often rely on simple one-pass streaming algorithms, while the offline methods for partitioning graphs contain more involved algorithms and the most successful methods in this category belong to the multilevel approaches. In this work, we assess the feasibility of using streaming graph partitioning algorithms within the multilevel framework. Our end goal is to come up with a fast parallel offline multilevel partitioner that can produce competitive cutsize quality. We rely on a simple but fast and flexible streaming algorithm throughout the entire multilevel framework. This streaming algorithm serves multiple purposes in the partitioning process: a clustering algorithm in the coarsening, an effective algorithm for the initial partitioning, and a fast refinement algorithm in the uncoarsening. Its simple nature also lends itself easily for parallelization. The experiments on various graphs show that our approach is on the average up to 5.1x faster than the multi-threaded MeTiS, which comes at the expense of only 2x worse cutsize.
Open Access
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
(Elsevier BV, 2016) Acer, S.; Selvitopi, O.; Aykanat, Cevdet
We propose a comprehensive and generic framework to minimize multiple and different volume-based communication cost metrics for sparse matrix dense matrix multiplication (SpMM). SpMM is an important kernel that finds application in computational linear algebra and big data analytics. On distributed memory systems, this kernel is usually characterized with its high communication volume requirements. Our approach targets irregularly sparse matrices and is based on both graph and hypergraph partitioning models that rely on the widely adopted recursive bipartitioning paradigm. The proposed models are lightweight, portable (can be realized using any graph and hypergraph partitioning tool) and can simultaneously optimize different cost metrics besides total volume, such as maximum send/receive volume, maximum sum of send and receive volumes, etc., in a single partitioning phase. They allow one to define and optimize as many custom volume-based metrics as desired through a flexible formulation. The experiments on a wide range of about thousand matrices show that the proposed models drastically reduce the maximum communication volume compared to the standard partitioning models that only address the minimization of total volume. The improvements obtained on volume-based partition quality metrics using our models are validated with parallel SpMM as well as parallel multi-source BFS experiments on two large-scale systems. For parallel SpMM, compared to the standard partitioning models, our graph and hypergraph partitioning models respectively achieve reductions of 14% and 22% in runtime, on average. Compared to the state-of-the-art partitioner UMPa, our graph model is overall 14.5 ï¿½ faster and achieves an average improvement of 19% in the partition quality on instances that are bounded by maximum volume. For parallel BFS, we show on graphs with more than a billion edges that the scalability can significantly be improved with our models compared to a recently proposed two-dimensional partitioning model.
Open Access
Optimizing nonzero-based sparse matrix partitioning models via reducing latency
(Academic Press, 2018) Acer, S.; Selvitopi, O.; Aykanat, Cevdet
For the parallelization of sparse matrix-vector multiplication (SpMV) on distributed memory systems, nonzero-based fine-grain and medium-grain partitioning models attain the lowest communication volume and computational imbalance among all partitioning models. This usually comes, however, at the expense of high message count, i.e., high latency overhead. This work addresses this shortcoming by proposing new fine-grain and medium-grain models that are able to minimize communication volume and message count in a single partitioning phase. The new models utilize message nets in order to encapsulate the minimization of total message count. We further fine-tune these models by proposing delayed addition and thresholding for message nets in order to establish a trade-off between the conflicting objectives of minimizing communication volume and message count. The experiments on an extensive dataset of nearly one thousand matrices show that the proposed models improve the total message count of the original nonzero-based models by up to 27% on the average, which is reflected on the parallel runtime of SpMV as an average reduction of 15% on 512 processors.
Open Access
A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously
(IEEE Computer Society, 2017) Selvitopi, O.; Acer, S.; Aykanat, Cevdet
Intelligent partitioning models are commonly used for efficient parallelization of irregular applications on distributed systems. These models usually aim to minimize a single communication cost metric, which is either related to communication volume or message count. However, both volume- and message-related metrics should be taken into account during partitioning for a more efficient parallelization. There are only a few works that consider both of them and they usually address each in separate phases of a two-phase approach. In this work, we propose a recursive hypergraph bipartitioning framework that reduces the total volume and total message count in a single phase. In this framework, the standard hypergraph models, nets of which already capture the bandwidth cost, are augmented with message nets. The message nets encode the message count so that minimizing conventional cutsize captures the minimization of bandwidth and latency costs together. Our model provides a more accurate representation of the overall communication cost by incorporating both the bandwidth and the latency components into the partitioning objective. The use of the widely-adopted successful recursive bipartitioning framework provides the flexibility of using any existing hypergraph partitioner. The experiments on instances from different domains show that our model on the average achieves up to 52 percent reduction in total message count and hence results in 29 percent reduction in parallel running time compared to the model that considers only the total volume. © 2016 IEEE.
Open Access
Reducing latency cost in 2D sparse matrix partitioning models
(Elsevier BV, 2016) Selvitopi, O.; Aykanat, Cevdet
Sparse matrix partitioning is a common technique used for improving performance of parallel linear iterative solvers. Compared to solvers used for symmetric linear systems, solvers for nonsymmetric systems offer more potential for addressing different multiple communication metrics due to the flexibility of adopting different partitions on the input and output vectors of sparse matrix-vector multiplication operations. In this regard, there exist works based on one-dimensional (1D) and two-dimensional (2D) fine-grain partitioning models that effectively address both bandwidth and latency costs in nonsymmetric solvers. In this work, we propose two new models based on 2D checkerboard and jagged partitioning. These models aim at minimizing total message count while maintaining a balance on communication volume loads of processors; hence, they address both bandwidth and latency costs. We evaluate all partitioning models on two nonsymmetric system solvers implemented using the widely adopted PETSc toolkit and conduct extensive experiments using these solvers on a modern system (a BlueGene/Q machine) successfully scaling them up to 8K processors. Along with the proposed models, we put practical aspects of eight evaluated models (two 1D- and six 2D-based) under thorough analysis. To the best of our knowledge, this is the first work that analyzes practical performance of 2D models on this scale. Among evaluated models, the models that rely on 2D jagged partitioning obtain the most promising results by striking a balance between minimizing bandwidth and latency costs.
Open Access
Regularizing irregularly sparse point-to-point communications
(Association for Computing Machinery, 2019) Selvitopi, O.; Aykanat, Cevdet
This work tackles the communication challenges posed by the latency-bound applications with irregular communication patterns, i.e., applications with high average and/or maximum message counts. We propose a novel algorithm for reorganizing a given set of irregular point-to-point messages with the objective of reducing total latency cost at the expense of increased volume. We organize processes into a virtual process topology inspired by the k-ary n-cube networks and regularize irregular messages by imposing regular communication pattern(s) onto them. Exploiting this process topology, we propose a flexible store-and-forward algorithm to control the trade-off between latency and volume. Our approach is able to reduce the communication time of sparse-matrix multiplication with latency-bound instances drastically: up to 22.6× for 16K processes on a 3D Torus network and up to 7.2× for 4K processes on a Dragonfly network, with its performance getting better with increasing number of processes.