Browsing by Subject "Cache Locality"

Now showing 1 - 3 of 3

Open Access
Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication
(Society for Industrial and Applied Mathematics, 2013-02-27) Akbudak, K.; Kayaaslan, E.; Aykanat, Cevdet
Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers. Matrices with irregular sparsity patterns make it difficult to utilize cache locality effectively in SpMxV computations. In this work, we investigate single-and multiple-SpMxV frameworks for exploiting cache locality in SpMxV computations. For the single-SpMxV framework, we propose two cache-size-aware row/column reordering methods based on one-dimensional (1D) and two-dimensional (2D) top-down sparse matrix partitioning. We utilize the column-net hypergraph model for the 1D method and enhance the row-column-net hypergraph model for the 2D method. The primary aim in both of the proposed methods is to maximize the exploitation of temporal locality in accessing input vector entries. The multiple-SpMxV framework depends on splitting a given matrix into a sum of multiple nonzero-disjoint matrices. We propose a cache-size-aware splitting method based on 2D top-down sparse matrix partitioning by utilizing the row-column-net hypergraph model. The aim in this proposed method is to maximize the exploitation of temporal locality in accessing both input-and output-vector entries. We evaluate the validity of our models and methods on a wide range of sparse matrices using both cache-miss simulations and actual runs by using OSKI. Experimental results show that proposed methods and models outperform state-of-the-art schemes. (c)2013 Society for Industrial and Applied Mathematics
Open Access
Increasing data reuse in parallel sparse matrix-vector and matrix-transpose-vector multiply on shared-memory architectures
(2014) Karsavuran, Mustafa Ozan
Sparse matrix-vector and matrix-transpose-vector multiplications (Sparse AAT x) are the kernel operations used in iterative solvers. Sparsity pattern of the input matrix A, as well as its transpose, remains the same throughout the iterations. CPU cache could not be used properly during these Sparse AAT x operations due to irregular sparsity pattern of the matrix. We propose two parallelization strategies for Sparse AAT x. Our methods partition A matrix in order to exploit cache locality for matrix nonzeros and vector entries. We conduct experiments on the recently-released Intel R Xeon PhiTM coprocessor involving large variety of sparse matrices. Experimental results show that proposed methods achieve higher performance improvement than the state-of-the-art methods in the literature.
Open Access
Locality aware reordering for sparse triangular solve
(2014) Torun, Tuğba
Sparse Triangular Solve (SpTS) is a commonly used kernel in a wide variety of scientific and engineering applications. Efficient implementation of this kernel on current architectures that involve deep cache hierarchy is crucial for attaining high performance. In this work, we propose an effective framework for cache-aware SpTS. Solution of sparse linear symmetric systems utilizing the direct methods require the triangular solve of the form LUz = b, where L is lower triangular factor and U is upper triangular factor. For cache utilization, we reorder the rows and columns of the L factor regarding the data dependencies of the triangular solve. We represent the data dependencies of the triangular solve as a directed hypergraph and construct an ordered partitioning model on this structure. For this purpose, we developed a variant of Fiduccia-Mattheyses (FM) algorithm which respects the dependency constraints. We also adopt the idea of splitting L factors into dense and sparse components and solving them seperately with different autotuned kernels for achieving more flexibility in this process. We investigate the performance variation of different storage schemes of L factors and the corresponding sparse and dense components. We utilize autotuning provided by Optimized Sparse Kernel Interface (OSKI) to reduce performance degradation that incurs due to the gap between processors and memory speeds. Experiments performed on real-world datasets verify the effectiveness of the proposed framework.