Browsing by Subject "Cache memory"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Open Access Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table(ACM, 2016- 05) Soltaniyeh, M.; Kadayıf, I.; Öztürk, ÖzcanChip multiprocessors (CMPs) require effective cache coher-ence protocols as well as fast virtual-To-physical address trans-lation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-The-Art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory struc-ture. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect consid-erably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the di-rectory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory struc-ture to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) in-volvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some por-tion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution. © 2016 Copyright held by the owner/author(s).Item Open Access NS-SRAM: neighborhood solidarity SRAM for reliability enhancement of SRAM memories(IEEE, 2016-08-09) Alouani, I.; Ahangari, Hamzeh; Öztürk, Özcan; Niar, S.Technology shift and voltage scaling increased the susceptibility of Static Random Access Memories (SRAMs) to errors dramatically. In this paper, we present NS-SRAM, for Neighborhood Solidarity SRAM, a new technique to enhance error resilience of SRAMs by exploiting the adjacent memory bit data. Bit cells of a memory line are paired together in circuit level to mutually increase the static noise margin and critical charge of a cell. Unlike existing techniques, NS-SRAM aims to enhance both Bit Error Rate (BER) and Soft Error rate (SER) at the same time. Due to auto-adaptive joiners, each of the adjacent cells' nodes is connected to its counterpart in the neighbor bit. NS-SRAM enhances read-stability by increasing critical Read Static Noise Margin (RSNM), thereby decreasing faults when circuit operates under voltage scaling. It also increases hold-stability and critical charge to mitigate soft-errors. By the proposed technique, reliability of SRAM based structures such as cache memories and register files can drastically be improved with comparable area overhead to existing hardening techniques. Moreover it does not require any extra-memory, does not impact the memory effective size, and has no negative impact on performance. © 2016 IEEE.Item Open Access Optimization-based power and thermal management for dark silicon aware 3D chip multiprocessors using heterogeneous cache hierarchy(Elsevier BV, 2017) Asad, A.; Ozturk, O.; Fathy, M.; Jahed-Motlagh, M. R.Management of a problem recently known as “dark silicon” is a new challenge in multicore designs. Prior innovative studies have addressed the dark silicon problem in the fields of power-efficient core design. However, addressing dark silicon challenges in uncore component designs such as cache hierarchy, on-chip interconnect etc. that consume significant portion of the on-chip power consumption is largely unexplored. In this paper, for the first time, we propose an integrated approach which considers the impact of power consumption of core and uncore components simultaneously to improve multi/many-core performance in the dark silicon era. The proposed approach dynamically (1) predicts the changing program behavior on each core; (2) re-determines frequency/voltage, cache capacity and technology in each level of the cache hierarchy based on the program's scalability in order to satisfy the power and temperature constraints. In the proposed architecture, for future chip-multiprocessors (CMPs), we exploit emerging technologies such as non-volatile memories (NVMs) and 3D techniques to combat dark silicon. Also, for the first time, we propose a detailed power model which is useful for future dark silicon CMPs power modeling. Experimental results on SPEC 2000/2006 benchmarks show that the proposed method improves throughput by about 54.3% and energy-delay product by about 61% on average, respectively, in comparison with the conventional CMP architecture with homogenous cache system. (A preliminary short version of this work was presented in the 18th Euromicro Conference on Digital System Design (DSD), 2015.) © 2017 Elsevier B.V.Item Open Access Optimizing shared cache behavior of chip multiprocessors(ACM, 2009-12) Kandemir, M.; Muralidhara, S. P.; Narayanan, S. H. K.; Zhang, Y.; Öztürk, ÖzcanOne of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme. Copyright 2009 ACM.Item Open Access Prefetch throttling and data pinning for improving performance of shared caches(IEEE, 2008-11) Öztürk, Özcan.; Son, S. W.; Kandemir, M.; Karaköy, M.In this paper, we (i) quantify the impact of compilerdirected I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of clients (compute nodes) is increased; (ii) identify interclient misses due to harmful I/O prefetches as one of the main sources for this reduction in performance with increased number of clients; and (iii) propose and experimentally evaluate prefetch throttling and data pinning schemes to improve performance of I/O prefetching. Prefetch throttling prevents one or more clients from issuing further prefetches if such prefetches are predicted to be harmful, i.e., replace from the memory cache the useful data accessed by other clients. Data pinning on the other hand makes selected data blocks immune to harmful prefetches by pinning them in the memory cache. We show that these two schemes can be applied in isolation or combined together, and they can be applied at a coarse or fine granularity. Our experiments with these two optimizations using four disk-intensive applications reveal that they can improve performance by 9.7% and 15.1% on average, over standard compiler-directed I/O prefetching and no-prefetch case, respectively, when 8 clients are used. © 2008 IEEE.Item Open Access Profiler and compiler assisted adaptive I/O prefetching for shared storage caches(ACM, 2008-10) Son, S. W.; Kandemir, M.; Kolcu, I.; Muralidhara, S. P.; Öztürk, Öztürk; Karakoy, M.I/O prefetching has been employed in the past as one of the mech- anisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this re- duction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our pro- posed scheme improves performance, on average, by 19.9%, 11.9% and http://dx.doi.org/10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used. Copyright 2008 ACM.Item Open Access Shared scratch pad memory space management across applications(Inderscience Publishers, 2009) Ozturk, Ozcan; Kandemir, M.; Son, S. W.; Kolcu, I.Scratch Pad Memories (SPMs) have received considerable attention lately as on-chip memory building blocks. The main characteristic that distinguishes an SPM from a conventional cache memory is that the data flow is controlled by software. The main focus of this paper is the management of an SPM space shared by multiple applications that can potentially share data. The proposed approach has three major components; a compiler analysis phase, a runtime space partitioner, and a local partitioning phase. Our experimental results show that the proposed approach leads to minimum completion time among all alternate memory partitioning schemes tested.