Browsing by Author "Kandemir, M."

Now showing 1 - 20 of 24

Open Access
Access pattern-based code compression for memory-constrained systems
(Association for Computing Machinery, 2008-09) Ozturk, O.; Kandemir, M.; Chen, G.
As compared to a large spectrum of performance optimizations, relatively less effort has been dedicated to optimize other aspects of embedded applications such as memory space requirements, power, real-time predictability, and reliability. In particular, many modern embedded systems operate under tight memory space constraints. One way of addressing this constraint is to compress executable code and data as much as possible. While researchers on code compression have studied efficient hardware and software based code compression strategies, many of these techniques do not take application behavior into account; that is, the same compression/decompression strategy is used irrespective of the application being optimized. This article presents an application-sensitive code compression strategy based on control flow graph (CFG) representation of the embedded program. The idea is to start with a memory image wherein all basic blocks of the application are compressed, and decompress only the blocks that are predicted to be needed in the near future. When the current access to a basic block is over, our approach also decides the point at which the block could be compressed. We propose and evaluate several compression and decompression strategies that try to reduce memory requirements without excessively increasing the original instruction cycle counts. Some of our strategies make use of profile data, whereas others are fully automatic. Our experimental evaluation using seven applications from the MediaBench suite and three large embedded applications reveals that the proposed code compression strategy is very successful in practice. Our results also indicate that working at a basic block granularity, as opposed to a procedure granularity, is important for maximizing memory space savings. © 2008 ACM.
Open Access
Adaptive prefetching for shared cache based chip multiprocessors
(IEEE, 2009-04) Kandemir, M.; Zhang, Y.; Öztürk, Özcan
Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts. © 2009 EDAA.
Open Access
Automatic segmentation of colon glands using object-graphs
(Elsevier BV, 2010) Gunduz Demir, C.; Kandemir, M.; Tosun, A. B.; Sokmensuer, C.
Gland segmentation is an important step to automate the analysis of biopsies that contain glandular structures. However, this remains a challenging problem as the variation in staining, fixation, and sectioning procedures lead to a considerable amount of artifacts and variances in tissue sections, which may result in huge variances in gland appearances. In this work, we report a new approach for gland segmentation. This approach decomposes the tissue image into a set of primitive objects and segments glands making use of the organizational properties of these objects, which are quantified with the definition of object-graphs. As opposed to the previous literature, the proposed approach employs the object-based information for the gland segmentation problem, instead of using the pixel-based information alone. Working with the images of colon tissues, our experiments demonstrate that the proposed object-graph approach yields high segmentation accuracies for the training and test sets and significantly improves the segmentation performance of its pixel-based counterparts. The experiments also show that the object-based structure of the proposed approach provides more tolerance to artifacts and variances in tissues. © 2009 Elsevier B.V. All rights reserved.
Open Access
A cache topology-aware multi-query scheduler for multicore architectures
(IEEE, 2014) Orhan, U.; Ding, W.; Yedlapalli, P.; Kandemir, M.; Öztürk, Özcan
Growing performance gap between processors and main memory has made it worthwhile to consider off-chip data accesses in multi-query processing [2], [1], [3]. Exploiting data-sharing opportunities among concurrent queries can be critical for effective utilization of the underlying shared memory hierarchy. Given a set of queries, there may be a common retrieval operation for several cases to the same data. A query can benefit from the data previously loaded into the shared cache/memory space by another query. However, if these queries are scheduled independently, it is very likely that the same data is brought from off-chip memory to on-chip caches multiple times, thereby consuming off-chip bandwidth and slowing down overall execution.
Open Access
Code scheduling for optimizing parallelism and data locality
(Springer, 2010-08-09) Yemliha, T.; Kandemir, M.; Öztürk, Özcan; Kultursay, E.; Muralidhara, S. P.
As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag.
Open Access
Compiler directed network-on-chip reliability enhancement for chip multiprocessors
(Association for Computing Machinery, 2010-04) Ozturk, O.; Kandemir, M.; Irwin, M. J.; Narayanan, S.H. K.
Chip multiprocessors (CMPs) are expected to be the building blocks for future computer systems. While architecting these emerging CMPs is a challenging problem on its own, programming them is even more challenging. As the number of cores accommodated in chip multiprocessors increases, network-on-chip (NoC) type communication fabrics are expected to replace traditional point-to-point buses. Most of the prior software related work so far targeting CMPs focus on performance and power aspects. However, as technology scales, components of a CMP are being increasingly exposed to both transient and permanent hardware failures. This paper presents and evaluates a compiler-directed power-performance aware reliability enhancement scheme for network-on-chip (NoC) based chip multiprocessors (CMPs). The proposed scheme improves on-chip communication reliability by duplicating messages traveling across CMP nodes such that, for each original message, its duplicate uses a different set of communication links as much as possible (to satisfy performance constraint). In addition, our approach tries to reuse communication links across the different phases of the program to maximize link shutdown opportunities for the NoC (to satisfy power constraint). Our results show that the proposed approach is very effective in improving on-chip network reliability, without causing excessive power or performance degradation. In our experiments, we also evaluate the performance oriented and energy oriented versions of our compiler-directed reliability enhancement scheme, and compare it to two pure hardware based fault tolerant routing schemes. © 2010 ACM.
Open Access
Compiler-directed energy reduction using dynamic voltage scaling and voltage islands for embedded systems
(Institute of Electrical and Electronics Engineers, 2013) Ozturk, O.; Kandemir, M.; Chen G.
Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters. © 1968-2012 IEEE.
Open Access
Dynamic thread and data mapping for NoC based CMPs
(IEEE, 2009-07) Kandemir, M.; Öztürk, Özcan; Muralidhara, S. P.
Thread mapping and data mapping are two important problems in the context of NoC (network-on-chip) based CMPs (chip multiprocessors). While a compiler can determine suitable mappings for data and threads, such static mappings may not work well for multithreaded applications that go through different execution phases during their execution, each phase with potentially different data access patterns than others. Instead, a dynamic mapping strategy, if its overheads can be kept low, may be a more promising option. In this work, we present dynamic (runtime) thread and data mappings for NoC based CMPs. The goal of these mappings is to reduce the distance between the location of the core that requests data and the core whose local memory contains that requested data. In our experiments, we evaluate our proposed thread mapping and data mapping in isolation as well as in an integrated manner. Copyright 2009 ACM.
Open Access
ILP-based energy minimization techniques for banked memories
(Association for Computing Machinery, 2008-07) Ozturk, O.; Kandemir, M.
Main memories can consume a significant portion of overall energy in many data-intensive embedded applications. One way of reducing this energy consumption is banking, that is, dividing available memory space into multiple banks and placing unused (idle) memory banks into low-power operating modes. Prior work investigated code-restructuring- and data-layout-reorganization-based approaches for increasing the energy benefits that could be obtained from a banked memory architecture. This article explores different techniques that can potentially coexist within the same optimization framework for maximizing benefits of low-power operating modes. These techniques include employing nonuniform bank sizes, data migration, data compression, and data replication. By using these techniques, we try to increase the chances for utilizing low-power operating modes in a more effective manner, and achieve further energy savings over what could be achieved by exploiting low-power modes alone. Specifically, nonuniform banking tries to match bank sizes with application-data access patterns. The goal of data migration is to cluster data with similar access patterns in the same set of banks. Data compression reduces the size of the data used by an application, and thus helps reduce the number of memory banks occupied by data. Finally, data replication increases bank idleness by duplicating select read-only data blocks across banks. We formulate each of these techniques as an ILP (integer linear programming) problem, and solve them using a commercial solver. Our experimental analysis using several benchmarks indicates that all the techniques presented in this framework are successful in reducing memory energy consumption. Based on our experience with these techniques, we recommend to compiler writers for banked memories to consider data compression, replication, and migration. © 2008 ACM.
Open Access
Improving multicore system performance through data compression
(Wiley, 2017) Öztürk, Özcan; Kandemir, M.; Pllana, S; Xhafa, F.
As applications become more and more complex, it is becoming extremely important to have sufficient compute power on the chip. Multicore and many-core systems have been introduced to address this problem. This chapter considers the multicore architecture that is a shared multiprocessor-based system, where a certain number of processors share the same memory address space. It uses a loop nest-based code parallelization strategy for executing array-based applications in this multicore architecture. The chapter focuses on array-based codes mainly because they appear very frequently in scientific computing domain and embedded image/video processing domain. It explores two different strategies for dividing the available processors between compression/decompression and application execution. In static strategy a fixed number of processors are allocated for performing compression/decompression activity, and this allocation is not changed during the course of execution. The main idea behind dynamic strategy is to eliminate the optimal processor selection problem of the static approach.
Open Access
Object-oriented texture analysis for the unsupervised segmentation of biopsy images for cancer detection
(Elsevier BV, 2009-06) Tosun, A. B.; Kandemir, M.; Sokmensuer, C.; Gunduz Demir, C.
Staining methods routinely used in pathology lead to similar color distributions in the biologically different regions of histopathological images. This causes problems in image segmentation for the quantitative analysis and detection of cancer. To overcome this problem, unlike previous methods that use pixel distributions, we propose a new homogeneity measure based on the distribution of the objects that we define to represent tissue components. Using this measure, we demonstrate a new object-oriented segmentation algorithm. Working with colon biopsy images, we show that this algorithm segments the cancerous and normal regions with 94.89 percent accuracy on the average and significantly improves the segmentation accuracy compared to its pixel-based counterpart. © 2008 Elsevier Ltd. All rights reserved.
Open Access
On-chip memory space partitioning for chip multiprocessors using polyhedral algebra
(The Institution of Engineering and Technology, 2010) Ozturk, O.; Kandemir, M.; Irwin, M. J.
One of the most important issues in designing a chip multiprocessor is to decide its on-chip memory organisation. While it is possible to design an application-specific memory architecture, this may not necessarily be the best option, in particular when storage demands of individual processors and/or their data sharing patterns can change from one point in execution to another for the same application. Here, two problems are formulated. First, we show how a polyhedral method can be used to design, for array-based data-intensive embedded applications, an application-specific hybrid memory architecture that has both shared and private components. We evaluate the resulting memory configurations using a set of benchmarks and compare them to pure private and pure shared memory on-chip multiprocessor architectures. The second approach proposed consider dynamic configuration of software-managed on-chip memory space to adapt to the runtime variations in data storage demand and interprocessor sharing patterns. The proposed framework is fully implemented using an optimising compiler, a polyhedral tool, and a memory partitioner (based on integer linear programming), and is tested using a suite of eight data-intensive embedded applications. © 2010 © The Institution of Engineering and Technology.
Open Access
Optimizing shared cache behavior of chip multiprocessors
(ACM, 2009-12) Kandemir, M.; Muralidhara, S. P.; Narayanan, S. H. K.; Zhang, Y.; Öztürk, Özcan
One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme. Copyright 2009 ACM.
Open Access
Prefetch throttling and data pinning for improving performance of shared caches
(IEEE, 2008-11) Öztürk, Özcan.; Son, S. W.; Kandemir, M.; Karaköy, M.
In this paper, we (i) quantify the impact of compilerdirected I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of clients (compute nodes) is increased; (ii) identify interclient misses due to harmful I/O prefetches as one of the main sources for this reduction in performance with increased number of clients; and (iii) propose and experimentally evaluate prefetch throttling and data pinning schemes to improve performance of I/O prefetching. Prefetch throttling prevents one or more clients from issuing further prefetches if such prefetches are predicted to be harmful, i.e., replace from the memory cache the useful data accessed by other clients. Data pinning on the other hand makes selected data blocks immune to harmful prefetches by pinning them in the memory cache. We show that these two schemes can be applied in isolation or combined together, and they can be applied at a coarse or fine granularity. Our experiments with these two optimizations using four disk-intensive applications reveal that they can improve performance by 9.7% and 15.1% on average, over standard compiler-directed I/O prefetching and no-prefetch case, respectively, when 8 clients are used. © 2008 IEEE.
Open Access
Process variation aware thread mapping for chip multiprocessors
(IEEE, 2009-04) Hong, S.; Narayanan, S. H. K.; Kandemir, M.; Özturk, Özcan
With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions. © 2009 EDAA.
Open Access
Profiler and compiler assisted adaptive I/O prefetching for shared storage caches
(ACM, 2008-10) Son, S. W.; Kandemir, M.; Kolcu, I.; Muralidhara, S. P.; Öztürk, Öztürk; Karakoy, M.
I/O prefetching has been employed in the past as one of the mech- anisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this re- duction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our pro- posed scheme improves performance, on average, by 19.9%, 11.9% and http://dx.doi.org/10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used. Copyright 2008 ACM.
Open Access
A scratch-pad memory aware dynamic loop scheduling algorithm
(IEEE, 2008-03) Öztürk, Özcan; Kandemir, M.; Narayanan, S. H. K.
Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments. © 2008 IEEE.
Open Access
Shared scratch pad memory space management across applications
(Inderscience Publishers, 2009) Ozturk, Ozcan; Kandemir, M.; Son, S. W.; Kolcu, I.
Scratch Pad Memories (SPMs) have received considerable attention lately as on-chip memory building blocks. The main characteristic that distinguishes an SPM from a conventional cache memory is that the data flow is controlled by software. The main focus of this paper is the management of an SPM space shared by multiple applications that can potentially share data. The proposed approach has three major components; a compiler analysis phase, a runtime space partitioner, and a local partitioning phase. Our experimental results show that the proposed approach leads to minimum completion time among all alternate memory partitioning schemes tested.
Open Access
Slicing based code parallelization for minimizing inter-processor communication
(ACM, 2009-10) Kandemir, M.; Zhang, Y.; Muralidhara, S. P.; Öztürk, Özcan; Narayanan, S. H. K.
One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested. Copyright 2009 ACM.
Open Access
SPM management using markov chain based data access prediction
(IEEE, 2008-11) Yemliha, T.; Srikantaiah, S.; Kandemir, M.; Öztürk, Özcan
Leveraging the power of scratchpad memories (SPMs) available in most embedded systems today is crucial to extract maximum performance from application programs. While regular accesses like scalar values and array expressions with affine subscript functions have been tractable for compiler analysis (to be prefetched into SPM), irregular accesses like pointer accesses and indexed array accesses have not been easily amenable for compiler analysis. This paper presents an SPM management technique using Markov chain based data access prediction for such irregular accesses. Our approach takes advantage of inherent, but hidden reuse in data accesses made by irregular references. We have implemented our proposed approach using an optimizing compiler. In this paper, we also present a thorough comparison of our different dynamic prediction schemes with other SPM management schemes. SPM management using our approaches produces 12.7% to 28.5% improvements in performance across a range of applications with both regular and irregular access patterns, with an average improvement of 20.8%.