Reducing coherency traffic volume in chip multiprocessors through pointer analysis

Date

2017-09

Editor(s)

Advisor

Öztürk, Özcan

Supervisor

Co-Advisor

Co-Supervisor

Instructor

Source Title

Print ISSN

Electronic ISSN

Publisher

Bilkent University

Volume

Issue

Pages

Language

English

Journal Title

Journal ISSN

Volume Title

Series

Abstract

With increasing number of cores in chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although snooping based protocols are appropriate solutions to small scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large scale CMPs require directory based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which core's cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinates simultaneous accesses to a cache block. As directory based protocols scaled to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this thesis, we present software mechanisms to improve effectiveness of directory based cache coherency on CMPs with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores. For example, read-only (private) data can be considered in this category. On the other hand, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's pointer analysis to analyze a program and mark its private instructions, i.e instructions that load or store private data, at compile time. Second, we run the program in Sniper Multi-Core Simulator [1] with the proposed hardware con guration. We used SPLASH-2 and PARSEC-2.1 parallel benchmarks to test our approach. Simulation results show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic.

Course

Other identifiers

Book Title

Citation

item.page.isversionof