Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

buir.contributor.authorAykanat, Cevdet
dc.citation.epage1726en_US
dc.citation.issueNumber6en_US
dc.citation.spage1713en_US
dc.citation.volumeNumber27en_US
dc.contributor.authorKarsavuran, M. O.en_US
dc.contributor.authorAkbudak K.en_US
dc.contributor.authorAykanat, Cevdeten_US
dc.date.accessioned2018-04-12T10:42:27Z
dc.date.available2018-04-12T10:42:27Z
dc.date.issued2016en_US
dc.departmentDepartment of Computer Engineeringen_US
dc.description.abstractSparse matrix-vector and matrix-transpose-vector multiplication (SpMMTV) repeatedly performed as z ← ATx and y ← A z (or y ← A w) for the same sparse matrix A is a kernel operation widely used in various iterative solvers. One important optimization for serial SpMMTV is reusing A-matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of SpMMTV that reuses A-matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of SpMMTV that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing A-matrix nonzeros and intermediate z-vector entries; exploiting locality in accessing x -, y -, and -vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of A. We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing A-matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods.en_US
dc.identifier.doi10.1109/TPDS.2015.2453970en_US
dc.identifier.issn1045-9219
dc.identifier.urihttp://hdl.handle.net/11693/36500
dc.language.isoEnglishen_US
dc.publisherInstitute of Electrical and Electronics Engineersen_US
dc.relation.isversionofhttp://dx.doi.org/10.1109/TPDS.2015.2453970en_US
dc.source.titleIEEE Transactions on Parallel and Distributed Systemsen_US
dc.subjectCache localityen_US
dc.subjectIntel many integrated core architecture (Intel MIC)en_US
dc.subjectMatrix reorderingen_US
dc.subjectSingly bordered block-diagonal formen_US
dc.subjectSparse matrixen_US
dc.subjectSparse matrix-vector multiplicationen_US
dc.subjectIterative methodsen_US
dc.subjectMatrix algebraen_US
dc.subjectVectorsen_US
dc.subjectBordered block diagonal formen_US
dc.subjectCache localityen_US
dc.subjectIntegrated coreen_US
dc.subjectIntel Xeon Phien_US
dc.subjectMatrix reorderingen_US
dc.subjectSparse matricesen_US
dc.subjectSparse matrix-vector multiplicationen_US
dc.subjectComputer architectureen_US
dc.titleLocality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processorsen_US
dc.typeArticleen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors.pdf
Size:
1.15 MB
Format:
Adobe Portable Document Format
Description:
Full printable version