Browsing by Subject "Program compilers"

Now showing 1 - 16 of 16

Open Access
Auto-parallelizing stateful distributed streaming applications
(2012) Schneider, S.; Hirzel, M.; Gedik, Buğra; Wu, K. -L.
Streaming applications transform possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. The streaming programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, it does not naturally expose data parallelism, which must instead be extracted from streaming applications. This paper presents a compiler and runtime system that automatically extract data parallelism for distributed stream processing. Our approach guarantees safety, even in the presence of stateful, selective, and userdefined operators. When constructing parallel regions, the compiler ensures safety by considering an operator's selectivity, state, partitioning, and dependencies on other operators in the graph. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for standard parallel regions, and near linear scalability when tuples are shuffled across parallel regions. Copyright © 2012 by the Association for Computing Machinery, Inc. (ACM).
Open Access
Code scheduling for optimizing parallelism and data locality
(Springer, 2010-08-09) Yemliha, T.; Kandemir, M.; Öztürk, Özcan; Kultursay, E.; Muralidhara, S. P.
As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag.
Open Access
Compiler-directed energy reduction using dynamic voltage scaling and voltage islands for embedded systems
(Institute of Electrical and Electronics Engineers, 2013) Ozturk, O.; Kandemir, M.; Chen G.
Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters. © 1968-2012 IEEE.
Open Access
A decoupled local memory allocator
(Association for Computing Machinery, 2013) Diouf, B.; Hantaş, C.; Cohen, A.; Özturk, Ö.; Palsberg, J.
Compilers use software-controlled local memories to provide fast, predictable, and power-efficient access to critical data. We show that the local memory allocation for straight-line, or linearized programs is equivalent to a weighted interval-graph coloring problem. This problem is new when allowing a color interval to "wrap around," and we call it the submarine-building problem. This graph-theoretical decision problem differs slightly from the classical ship-building problem, and exhibits very interesting and unusual complexity properties. We demonstrate that the submarine-building problem is NP-complete, while it is solvable in linear time for not-so-proper interval graphs, an extension of the the class of proper interval graphs. We propose a clustering heuristic to approximate any interval graph into a not-so-proper interval graph, decoupling spill code generation from local memory assignment. We apply this heuristic to a large number of randomly generated interval graphs reproducing the statistical features of standard local memory allocation benchmarks, comparing with state-of-the-art heuristics. © 2013 ACM.
Open Access
Efficient vectorization of forward/backward substitutions in solving sparse linear equations
(IEEE, 1994) Aykanat, Cevdet; Özgü, Özlem; Güven, N.
Vector processors have promised an enormous increase in computing speed for computationally intensive and time-critical power system problems which require the repeated solution of sparse linear equations. Due to short vectors processed in these applications, standard sparsity-based algorithms need to be restructured for efficient vectorization. This paper presents a novel data storage scheme and an efficient vectorization algorithm that exploits the intrinsic architectural features of vector computers such as sectioning and chaining. As the benchmark, the solution phase of the Fast Decoupled Load Flow algorithm is used in simulations. The relative performances of the proposed and existing vectorization schemes are evaluated, both theoretically and experimentally, on IBM 3090/VF.
Open Access
G-free: Defeating return-oriented programming through gadget-less binaries
(ACM, 2010-12) Onarlıoğlu, Kaan; Bilge, L.; Lanzi, A.; Balzarotti, D.; Kirda, E.
Despite the numerous prevention and protection mechanisms that have been introduced into modern operating systems, the exploitation of memory corruption vulnerabilities still represents a serious threat to the security of software systems and networks. A recent exploitation technique, called Return-Oriented Programming (ROP), has lately attracted a considerable attention from academia. Past research on the topic has mostly focused on refining the original attack technique, or on proposing partial solutions that target only particular variants of the attack. In this paper, we present G-Free, a compiler-based approach that represents the first practical solution against any possible form of ROP. Our solution is able to eliminate all unaligned free-branch instructions inside a binary executable, and to protect the aligned free-branch instructions to prevent them from being misused by an attacker. We developed a prototype based on our approach, and evaluated it by compiling GNU libc and a number of real-world applications. The results of the experiments show that our solution is able to prevent any form of return-oriented programming. © 2010 ACM.
Open Access
Hybrid stacked memory architecture for energy efficient embedded chip-multiprocessors based on compiler directed approach
(IEEE, 2015-12) Onsori, Salman; Asad, A.; Öztürk, Özcan; Fathy, M.
Energy consumption becomes the most critical limitation on the performance of nowadays embedded system designs. On-chip memories due to major contribution in overall system energy consumption are always significant issue for embedded systems. Using conventional memory technologies in future designs in nano-scale era causes a drastic increase in leakage power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technologies are promising replacement for conventional memory structure in embedded systems due to its attractive characteristics such as near-zero leakage power, high density and non-volatility. Recent advantages of NVM technologies can significantly mitigate the issue of memory leakage power. However, they introduce new challenges such as limited write endurance and high write energy consumption which restrict them for adoption in modern memory systems. In this article, we propose a stacked hybrid memory system to minimize energy consumption for 3D embedded chip-multiprocessors (eCMP). For reaching this target, we present a convex optimization-based model to distribute data blocks between SRAM and NVM banks based on data access pattern derived by compiler. Our compiler-assisted hybrid memory architecture can achieve up to 51.28 times improvement in lifetime. In addition, experimental results show that our proposed method reduce energy consumption by 56% on average compared to the traditional memory design where single technology is used. © 2015 IEEE.
Open Access
Integrated scheduling and tool management in flexible manufacturing systems
(Taylor & Francis, 2001) Aktürk, M. S.; Özkan, S.
A multistage algorithm is proposed that will solve the scheduling problem in a flexible manufacturing system by considering the interrelated subproblems of processing time control, tool allocation and machining conditions optimization. The main objective of the proposed algorithm is to minimize total production cost consisting of tooling, operational and tardiness costs. The proposed integrated approach recognizes an important trade-off in automated manufacturing systems that has been largely unrecognized, and which is believed can be effectively exploited to improve production efficiency and lead to substantial cost reductions.
Open Access
Optimizing local memory allocation and assignment through a decoupled approach
(Springer, 2010-10) Diouf, B.; Öztürk, Özcan; Cohen, A.
Software-controlled local memories (LMs) are widely used to provide fast, scalable, power efficient and predictable access to critical data. While many studies addressed LM management, keeping hot data in the LM continues to cause major headache. This paper revisits LM management of arrays in light of recent progresses in register allocation, supporting multiple live-range splitting schemes through a generic integer linear program. These schemes differ in the grain of decision points. The model can also be extended to address fragmentation, assigning live ranges to precise offsets. We show that the links between LM management and register allocation have been underexploited, leaving much fundamental questions open and effective applications to be explored. © 2010 Springer-Verlag.
Open Access
Optimizing shared cache behavior of chip multiprocessors
(ACM, 2009-12) Kandemir, M.; Muralidhara, S. P.; Narayanan, S. H. K.; Zhang, Y.; Öztürk, Özcan
One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme. Copyright 2009 ACM.
Open Access
Profiler and compiler assisted adaptive I/O prefetching for shared storage caches
(ACM, 2008-10) Son, S. W.; Kandemir, M.; Kolcu, I.; Muralidhara, S. P.; Öztürk, Öztürk; Karakoy, M.
I/O prefetching has been employed in the past as one of the mech- anisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this re- duction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our pro- posed scheme improves performance, on average, by 19.9%, 11.9% and http://dx.doi.org/10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used. Copyright 2008 ACM.
Open Access
Safe data parallelism for general streaming
(Institute of Electrical and Electronics Engineers, 2015) Schneider S.; Hirzel M.; Gedik, B.; Wu, Kun-Lung
Streaming applications process possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. General streaming applications use stateful, selective, and user-defined operators. The stream programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, data parallelism must either be manually introduced by programmers, or extracted as an optimization by compilers. Previous data parallel optimizations did not apply to selective, stateful and user-defined operators. This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing. Data-parallelization is safe if the transformed program has the same semantics as the original sequential version. The compiler forms parallel regions while considering operator selectivity, state, partitioning, and graph dependencies. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for parallel regions that are computation-bound, and near linear scalability when tuples are shuffled across parallel regions.
Open Access
A scratch-pad memory aware dynamic loop scheduling algorithm
(IEEE, 2008-03) Öztürk, Özcan; Kandemir, M.; Narayanan, S. H. K.
Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments. © 2008 IEEE.
Open Access
Slicing based code parallelization for minimizing inter-processor communication
(ACM, 2009-10) Kandemir, M.; Zhang, Y.; Muralidhara, S. P.; Öztürk, Özcan; Narayanan, S. H. K.
One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested. Copyright 2009 ACM.
Open Access
SPM management using markov chain based data access prediction
(IEEE, 2008-11) Yemliha, T.; Srikantaiah, S.; Kandemir, M.; Öztürk, Özcan
Leveraging the power of scratchpad memories (SPMs) available in most embedded systems today is crucial to extract maximum performance from application programs. While regular accesses like scalar values and array expressions with affine subscript functions have been tractable for compiler analysis (to be prefetched into SPM), irregular accesses like pointer accesses and indexed array accesses have not been easily amenable for compiler analysis. This paper presents an SPM management technique using Markov chain based data access prediction for such irregular accesses. Our approach takes advantage of inherent, but hidden reuse in data accesses made by irregular references. We have implemented our proposed approach using an optimizing compiler. In this paper, we also present a thorough comparison of our different dynamic prediction schemes with other SPM management schemes. SPM management using our approaches produces 12.7% to 28.5% improvements in performance across a range of applications with both regular and irregular access patterns, with an average improvement of 20.8%.
Open Access
Using data compression for increasing memory system utilization
(Institute of Electrical and Electronics Engineers, 2009-06) Ozturk, O.; Kandemir, M.; Irwin, M. J.
The memory system presents one of the critical challenges in embedded system design and optimization. This is mainly due to the ever-increasing code complexity of embedded applications and the exponential increase seen in the amount of data they manipulate. The memory bottleneck is even more important for multiprocessor-system-on-a-chip (MPSoC) architectures due to the high cost of off-chip memory accesses in terms of both energy and performance. As a result, reducing the memory-space occupancy of embedded applications is very important and will be even more important in the next decade. While it is true that the on-chip memory capacity of embedded systems is continuously increasing, the increases in the complexity of embedded applications and the sizes of the data sets they process are far greater. Motivated by this observation, this paper presents and evaluates a compiler-driven approach to data compression for reducing memory-space occupancy. Our goal is to study how automated compiler support can help in deciding the set of data elements to compress/ decompress and the points during execution at which these compressions/decompressions should be performed. We first study this problem in the context of single-core systems and then extend it to MPSoCs where we schedule compressions and decompressions intelligently such that they do not conflict with application execution as much as possible. Particularly, in MPSoCs, one needs to decide which processors should participate in the compression and decompression activities at any given point during the course of execution. We propose both static and dynamic algorithms for this purpose. In the static scheme, the processors are divided into two groups: those performing compression/ decompression and those executing the application, and this grouping is maintained throughout the execution of the application. In the dynamic scheme, on the other hand, the execution starts with some grouping but this grouping can change during the course of execution, depending on the dynamic variations in the data access pattern. Our experimental results show that, in a single-core system, the proposed approach reduces maximum memory occupancy by 47.9% and average memory occupancy by 48.3% when averaged over all the benchmarks. Our results also indicate that, in an MPSoC, the average energy saving is 12.7% when all eight benchmarks are considered. While compressions and decompressions and related bookkeeping activities take extra cycles and memory space and consume additional energy, we found that the improvements they bring from the memory space, execution cycles, and energy perspectives are much higher than these overheads. © 2009 IEEE.