Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

Karsavuran, M. O.; Akbudak K.; Aykanat, Cevdet

Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

Files

Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors.pdf (1.15 MB)

Date

2016

Authors

Karsavuran, M. O.

Akbudak K.

Aykanat, Cevdet

BUIR Usage Stats

0
views

36
downloads

Citation Stats

Abstract

Sparse matrix-vector and matrix-transpose-vector multiplication (SpMMTV) repeatedly performed as z ← ATx and y ← A z (or y ← A w) for the same sparse matrix A is a kernel operation widely used in various iterative solvers. One important optimization for serial SpMMTV is reusing A-matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of SpMMTV that reuses A-matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of SpMMTV that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing A-matrix nonzeros and intermediate z-vector entries; exploiting locality in accessing x -, y -, and -vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of A. We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing A-matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods.

Source Title

IEEE Transactions on Parallel and Distributed Systems

Publisher

Institute of Electrical and Electronics Engineers

Keywords

Cache locality, Intel many integrated core architecture (Intel MIC), Matrix reordering, Singly bordered block-diagonal form, Sparse matrix, Sparse matrix-vector multiplication, Iterative methods, Matrix algebra, Vectors, Bordered block diagonal form, Cache locality, Integrated core, Intel Xeon Phi, Matrix reordering, Sparse matrices, Sparse matrix-vector multiplication, Computer architecture

Permalink

http://hdl.handle.net/11693/36500

Published Version (Please cite this version)

http://dx.doi.org/10.1109/TPDS.2015.2453970

Collections

Scholarly Publications - Computer Engineering

Language

English

Type

Article

Full item page

Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Citation Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Citation Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type