Reducing coherency traffic volume in chip multiprocessors through pointer analysis
Date
Authors
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
Source Title
Print ISSN
Electronic ISSN
Publisher
Volume
Issue
Pages
Language
Type
Journal Title
Journal ISSN
Volume Title
Attention Stats
Usage Stats
views
downloads
Series
Abstract
With increasing number of cores in chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although snooping based protocols are appropriate solutions to small scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large scale CMPs require directory based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which core's cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinates simultaneous accesses to a cache block. As directory based protocols scaled to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this thesis, we present software mechanisms to improve effectiveness of directory based cache coherency on CMPs with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores. For example, read-only (private) data can be considered in this category. On the other hand, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's pointer analysis to analyze a program and mark its private instructions, i.e instructions that load or store private data, at compile time. Second, we run the program in Sniper Multi-Core Simulator [1] with the proposed hardware con guration. We used SPLASH-2 and PARSEC-2.1 parallel benchmarks to test our approach. Simulation results show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic.