Browsing by Subject "Microprocessor chips"

Now showing 1 - 18 of 18

Open Access
Adaptive prefetching for shared cache based chip multiprocessors
(IEEE, 2009-04) Kandemir, M.; Zhang, Y.; Öztürk, Özcan
Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts. © 2009 EDAA.
Open Access
Application-specific heterogeneous network-on-chip design
(Oxford University Press, 2014) Demirbas, D.; Akturk, I.; Ozturk, O.; Güdükbay, Uğur
As a result of increasing communication demands, application-specific and scalable Network-on-Chips (NoCs) have emerged to connect processing cores and subsystems in Multiprocessor System-on-Chips. A challenge in application-specific NoC design is to find the right balance among different tradeoffs, such as communication latency, power consumption and chip area. We propose a novel approach that generates latency-aware heterogeneous NoC topology. Experimental results show that our approach improves the total communication latency up to 27% with modest power consumption. © 2013 The Author 2013. Published by Oxford University Press on behalf of The British Computer Society.
Open Access
Code scheduling for optimizing parallelism and data locality
(Springer, 2010-08-09) Yemliha, T.; Kandemir, M.; Öztürk, Özcan; Kultursay, E.; Muralidhara, S. P.
As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag.
Open Access
An efficient computation model for coarse grained reconfigurable architectures and its applications to a reconfigurable computer
(IEEE, 2010-07) Atak, Oğuzhan; Atalar, Abdullah
The mapping of high level applications onto the coarse grained reconfigurable architectures (CGRA) are usually performed manually by using graphical tools or when automatic compilation is used, some restrictions are imposed to the high level code. Since high level applications do not contain parallelism explicitly, mapping the application directly to CGRA is very difficult. In this paper, we present a middle level Language for Reconfigurable Computing (LRC). LRC is similar to assembly languages of microprocessors, with the difference that parallelism can be coded in LRC. LRC is an efficient language for describing control data flow graphs. Several applications such as FIR, multirate, multichannel filtering, FFT, 2D-IDCT, Viterbi decoding, UMTS and CCSDC turbo decoding, Wimax LDPC decoding are coded in LRC and mapped to the Bilkent Reconfigurable Computer with a performance (in terms of cycle count) close to that of ASIC implementations. The applicability of the computation model to a CGRA having low cost interconnection network has been validated by using placement and routing algorithms. © 2010 IEEE.
Open Access
Energy efficient architecture for graph analytics accelerators
(IEEE, 2016-06) Özdal, Muhammet Mustafa; Yeşil, Şerif; Kim, T.; Ayupov, A.; Greth, J.; Burns, S.; Öztürk, Özcan
Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area. © 2016 IEEE.
Open Access
Fault-tolerant irregular topology design method for network-on-chips
(IEEE, 2014) Tosun, S.; Ajabshir V.B.; Mercanoglu O.; Öztürk, Özcan
As the technology sizes of integrated circuits (ICs) scale down rapidly, current transistor densities on chips dramatically increase. While nanometer feature sizes allow denser chip designs in each technology generation, fabricated ICs become more susceptible to wear-outs, causing operation failure. Even a single link failure within an on-chip fabric can halt communication between application blocks, which makes the entire chip useless. In this study, we aim to make faulty chips designed with Network-on-Chip (NoC) communication usable. Specifically, we present a fault-tolerant irregular topology generation method for application specific NoC designs. Designed NoC topology allows a different routing path if there is a link failure on the default routing. We compare fault-tolerant topologies with regular fault-tolerant ring topologies, and non-fault-tolerant application specific irregular topologies on energy consumption, performance, and area using multimedia benchmarks and custom-generated graphs. © 2014 IEEE.
Open Access
Fault-tolerant topology generation method for application-specific network-on-chips
(Institute of Electrical and Electronics Engineers, 2015) Tosun, S.; Ajabshir, V. B.; Mercanoglu, O.; Ozturk, O.
As the technology sizes of integrated circuits (ICs) scale down rapidly, current transistor densities on chips dramatically increase. While nanometer feature sizes allow denser chip designs in each technology generation, fabricated ICs become more susceptible to wear-outs, causing operation failure. Even a single link failure within an on-chip fabric can halt communication between application blocks, which makes the entire chip useless. In this paper, we aim to make faulty chips designed with network-on-chip (NoC) communication usable. Specifically, we present fault-tolerant irregular topology-generation method for application-specific NoC designs. Designed NoC topology allows different routing path if there is a link failure on the default routing path. Additionally, we present a simulated annealing-based application mapping algorithm aiming to minimize total energy consumption of the NoC design. We compare fault-tolerant topologies with nonfault-tolerant application-specific irregular topologies on energy consumption, performance, and area using multimedia benchmarks and custom-generated graphs. Our results demonstrate that our method is able to determine fault-tolerant topologies with negligible area increase and better energy values.
Open Access
Heterogeneous network-on-chip design through evolutionary computing
(Taylor & Francis, 2010) Ozturk, O.; Demirbas, D.
This article explores the use of biologically inspired evolutionary computational techniques for designing and optimising heterogeneous network-on-chip (NoC) architectures, where the nodes of the NoC-based chip multiprocessor exhibit different properties such as performance, energy, temperature, area and communication bandwidth. Focusing primarily on array-dominated applications and heterogeneous execution environments, the proposed approach tries to optimise the distribution of the nodes for a given NoC area under the constraints present in the environment. This article is the first one, to our knowledge, that explores the possibility of employing evolutionary computational techniques for optimally placing the heterogeneous nodes in an NoC. We also compare our approach with an optimal integer linear programming (ILP) approach using a commercial ILP tool. The results collected so far are very encouraging and indicate that the proposed approach generates close results to the ILP-based approach with minimal execution latencies. © 2010 Taylor & Francis.
Open Access
An ILP formulation for application mapping onto Network-on-Chips
(IEEE, 2009) Tosun, S.; Öztürk, Özcan; Ozen, M.
Ever shrinking technologies in VLSI era made it possible to place several modules onto a single die. However, the need for the new communication methods has also increased dramatically since traditional bus-based systems suffer from signal propagation delays, signal integrity, and scalability. Network-on-Chip (NoC) is the biggest step towards the communication bottleneck of System-on-Chip (SoC) architectures. In this paper, we present an Integer Linear Programming (ILP) formulation for application mapping onto mesh based Network-on-Chips to minimize the energy consumption of the system. The proposed method obtains optimal or close to optimal results within the given computation time limit. We also experimentally investigate the impact of the size of the mesh architecture on the application mapping and total communication. ©2009 IEEE.
Open Access
ILP-based communication reduction for heterogeneous 3D network-on-chips
(IEEE, 2013-02-03) Aktürk, İsmail; Öztürk, Özcan
Network-on-Chip (NoC) architectures and three-dimensional integrated circuits (3D ICs) have been introduced as attractive options for overcoming the barriers in interconnect scaling while increasing the number of cores. Combining these two approaches is expected to yield better performance and higher scalability. This paper explores the possibility of combining these two techniques in a heterogeneity aware fashion. We explore how heterogeneous processors can be mapped onto the given 3D chip area to minimize the data access costs. Our initial results indicate that the proposed approach generates promising results within tolerable solution times. © 2013 IEEE.
Open Access
Low power UWB transceiver design using dynamic voltage scaling
(IEEE, 2007-03) Garg, R.; Chunjie, D.; Jinyun, Z.; Gezici, Sinan
Low power consumption is a critical issue in many UWB systems. In this paper, we investigate the application of dynamic voltage scaling (DVS) and other low power design techniques to a multiband-OFDM UWB transceiver baseband circuit design in order to reduce average power consumption of the chip. Our results show significant power savings over the conventional approach. © 2007 IEEE.
Open Access
Multicore education through simulation
(IEEE, 2009-07) Öztürk, Özcan
This paper presents the experiences using a commercial full system simulation platform - Simics - in a graduate Chip Multiprocessors class. The Simics platform enables students and researchers to do research on computer architecture, operating systems, and hardware/software cosimulation. It provides the ability to simulate machines that are not physically available. This platform has been used in Chip Multiprocessors course to help graduate and undergraduate students in related areas. This course deals with both hardware and software issues in Chip Multiprocessors, and concludes with a team project at the end of the semester. The simulation-based approach was successful when student feedback and final projects are considered. ©2009 IEE.
Open Access
On-chip memory space partitioning for chip multiprocessors using polyhedral algebra
(The Institution of Engineering and Technology, 2010) Ozturk, O.; Kandemir, M.; Irwin, M. J.
One of the most important issues in designing a chip multiprocessor is to decide its on-chip memory organisation. While it is possible to design an application-specific memory architecture, this may not necessarily be the best option, in particular when storage demands of individual processors and/or their data sharing patterns can change from one point in execution to another for the same application. Here, two problems are formulated. First, we show how a polyhedral method can be used to design, for array-based data-intensive embedded applications, an application-specific hybrid memory architecture that has both shared and private components. We evaluate the resulting memory configurations using a set of benchmarks and compare them to pure private and pure shared memory on-chip multiprocessor architectures. The second approach proposed consider dynamic configuration of software-managed on-chip memory space to adapt to the runtime variations in data storage demand and interprocessor sharing patterns. The proposed framework is fully implemented using an optimising compiler, a polyhedral tool, and a memory partitioner (based on integer linear programming), and is tested using a suite of eight data-intensive embedded applications. © 2010 © The Institution of Engineering and Technology.
Open Access
Optimizing shared cache behavior of chip multiprocessors
(ACM, 2009-12) Kandemir, M.; Muralidhara, S. P.; Narayanan, S. H. K.; Zhang, Y.; Öztürk, Özcan
One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme. Copyright 2009 ACM.
Open Access
Process variation aware thread mapping for chip multiprocessors
(IEEE, 2009-04) Hong, S.; Narayanan, S. H. K.; Kandemir, M.; Özturk, Özcan
With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions. © 2009 EDAA.
Open Access
A scratch-pad memory aware dynamic loop scheduling algorithm
(IEEE, 2008-03) Öztürk, Özcan; Kandemir, M.; Narayanan, S. H. K.
Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments. © 2008 IEEE.
Open Access
Shared scratch pad memory space management across applications
(Inderscience Publishers, 2009) Ozturk, Ozcan; Kandemir, M.; Son, S. W.; Kolcu, I.
Scratch Pad Memories (SPMs) have received considerable attention lately as on-chip memory building blocks. The main characteristic that distinguishes an SPM from a conventional cache memory is that the data flow is controlled by software. The main focus of this paper is the management of an SPM space shared by multiple applications that can potentially share data. The proposed approach has three major components; a compiler analysis phase, a runtime space partitioner, and a local partitioning phase. Our experimental results show that the proposed approach leads to minimum completion time among all alternate memory partitioning schemes tested.
Open Access
Using data compression for increasing memory system utilization
(Institute of Electrical and Electronics Engineers, 2009-06) Ozturk, O.; Kandemir, M.; Irwin, M. J.
The memory system presents one of the critical challenges in embedded system design and optimization. This is mainly due to the ever-increasing code complexity of embedded applications and the exponential increase seen in the amount of data they manipulate. The memory bottleneck is even more important for multiprocessor-system-on-a-chip (MPSoC) architectures due to the high cost of off-chip memory accesses in terms of both energy and performance. As a result, reducing the memory-space occupancy of embedded applications is very important and will be even more important in the next decade. While it is true that the on-chip memory capacity of embedded systems is continuously increasing, the increases in the complexity of embedded applications and the sizes of the data sets they process are far greater. Motivated by this observation, this paper presents and evaluates a compiler-driven approach to data compression for reducing memory-space occupancy. Our goal is to study how automated compiler support can help in deciding the set of data elements to compress/ decompress and the points during execution at which these compressions/decompressions should be performed. We first study this problem in the context of single-core systems and then extend it to MPSoCs where we schedule compressions and decompressions intelligently such that they do not conflict with application execution as much as possible. Particularly, in MPSoCs, one needs to decide which processors should participate in the compression and decompression activities at any given point during the course of execution. We propose both static and dynamic algorithms for this purpose. In the static scheme, the processors are divided into two groups: those performing compression/ decompression and those executing the application, and this grouping is maintained throughout the execution of the application. In the dynamic scheme, on the other hand, the execution starts with some grouping but this grouping can change during the course of execution, depending on the dynamic variations in the data access pattern. Our experimental results show that, in a single-core system, the proposed approach reduces maximum memory occupancy by 47.9% and average memory occupancy by 48.3% when averaged over all the benchmarks. Our results also indicate that, in an MPSoC, the average energy saving is 12.7% when all eight benchmarks are considered. While compressions and decompressions and related bookkeeping activities take extra cycles and memory space and consume additional energy, we found that the improvements they bring from the memory space, execution cycles, and energy perspectives are much higher than these overheads. © 2009 IEEE.