Browsing by Subject "Parallel architectures"

Now showing 1 - 15 of 15

Open Access
Architecture framework for mapping parallel algorithms to parallel computing platforms
(CEUR-WS, 2013) Tekinerdogan, Bedir; Arkin, E.
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, and the mapping of the algorithm to the logical configuration platform. Unfortunately, in current parallel computing approaches there does not seem to be precise modeling approaches for supporting the mapping process. The lack of a clear and precise modeling approach for parallel computing impedes the communication and analysis of the decisions for supporting the mapping of parallel algorithms to parallel computing platforms. In this paper we present an architecture framework for modeling the various views that are related to the mapping process. An architectural framework organizes and structures the proposed architectural viewpoints. We propose five coherent set of viewpoints for supporting the mapping of parallel algorithms to parallel computing platforms. We illustrate the architecture framework for the mapping of array increment algorithm to the parallel computing platform. Copyright © 2013 for the individual papers by the papers' authors.
Open Access
Auto-parallelizing stateful distributed streaming applications
(2012) Schneider, S.; Hirzel, M.; Gedik, Buğra; Wu, K. -L.
Streaming applications transform possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. The streaming programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, it does not naturally expose data parallelism, which must instead be extracted from streaming applications. This paper presents a compiler and runtime system that automatically extract data parallelism for distributed stream processing. Our approach guarantees safety, even in the presence of stateful, selective, and userdefined operators. When constructing parallel regions, the compiler ensures safety by considering an operator's selectivity, state, partitioning, and dependencies on other operators in the graph. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for standard parallel regions, and near linear scalability when tuples are shuffled across parallel regions. Copyright © 2012 by the Association for Computing Machinery, Inc. (ACM).
Open Access
Balancing energy loads in wireless sensor networks through uniformly quantized energy levels-based clustering
(IEEE, 2010) Ali, Syed Amjad; Sevgi, Cüneyt; Kocyigit, A.
Clustering is considered a common and an effective method to prolong the lifetime of a wireless sensor network. This paper provides a new insight into the cluster formation process based on uniformly quantizing the residual energy of the sensor nodes. The unified simulation framework provided herein, not only aids to reveal an optimum number of clusters but also the required number of quantization levels to maximize the network's lifetime by improving energy load balancing for both homogeneous and heterogeneous sensor networks. The provided simulation results clearly show that the uniformly quantized energy level-based clustering provides improved load balancing and hence, a longer network lifetime than existing methods. © 2010 IEEE.
Open Access
A catalog of stream processing optimizations
(Association for Computing Machinery, 2014) Hirzel M.; Soulé R.; Schneider S.; Gedik, B.; Grimm, R.
Various research communities have independently arrived at stream processing as a programming model for efficient and parallel computing. These communities include digital signal processing, databases, operating systems, and complex event processing. Since each community faces applications with challenging performance requirements, each of them has developed some of the same optimizations, but often with conflicting terminology and unstated assumptions. This article presents a survey of optimizations for stream processing. It is aimed both at users who need to understand and guide the system's optimizer and at implementers who need to make engineering tradeoffs. To consolidate terminology, this article is organized as a catalog, in a style similar to catalogs of design patterns or refactorings. To make assumptions explicit and help understand tradeoffs, each optimization is presented with its safety constraints (when does it preserve correctness?) and a profitability experiment (when does it improve performance?). We hope that this survey will help future streaming system builders to stand on the shoulders of giants from not just their own community. © 2014 ACM.
Open Access
Domain specific language for deployment of parallel applications on parallel computing platforms
(Association for Computing Machinery, 2014-08) Arkın, E.; Tekinerdoğan, Bedir
To increase the computing performance the current trend is towards applying parallel computing in which parallel tasks are executed on multiple nodes. The deployment of tasks on the computing platform usually impacts the overall performance and as such needs to be modelled carefully. In the architecture design community the deployment viewpoint is an important viewpoint to support this mapping process. In general the derived deployment views are visual notations that are not amenable for run-time processing, and do not scale well for deployment of large scale parallel applications. In this paper we propose a domain specific language (DSL) for modeling the deployment of parallel applications and for providing automated support for the deployment process. The DSL is based on a metamodel that is derived after a domain analysis on parallel computing. We illustrate the application of the DSL for a traffic simulation system and provide a set of important scenarios for using the DSL. © 2014 ACM.
Open Access
Energy load balancing for fixed clustering in wireless sensor networks
(IEEE, 2012) Ali, Syed Amjad; Sevgi, C.
Clustering can be used as an effective technique to achieve both energy load balancing and an extended lifetime for a wireless sensor network (WSN). This paper presents a novel approach that first creates energy balanced fixed/static clusters, and then, to attain energy load balancing within each fixed cluster, rotates the role of cluster head through uniformly quantized energy levels based approach to prolong the overall network lifetime. The method provided herein, not only provides near-dynamic clustering performance but also reduces the complexity due to the fact that cluster formation phase is implemented once. The presented simulation results clearly show the efficacy of this proposed algorithm and thus, it can be used as a practical approach to obtain maximized network lifetime for energy balanced clusters in fixed clustering environments. © 2012 IEEE.
Open Access
Graph analytics accelerators for cognitive systems
(Institute of Electrical and Electronics Engineers, 2017) Ozdal, M. M.; Yesil, S.; Kim, T.; Ayupov, A.; Greth, J.; Burns, S.; Ozturk, O.
Hardware accelerators are known to be performance and power efficient. This article focuses on accelerator design for graph analytics applications, which are commonly used kernels for cognitive systems. The authors propose a templatized architecture that is specifically optimized for vertex-centric graph applications with irregular memory access patterns, asynchronous execution, and asymmetric convergence. The proposed architecture addresses the limitations of existing CPU and GPU systems while providing a customizable template. The authors' experiments show that the generated accelerators can outperform a high-end CPU system with up to 3 times better performance and 65 times better power efficiency. © 1981-2012 IEEE.
Open Access
Model-driven approach for supporting the mapping of parallel algorithms to parallel computing platforms
(Springer, Berlin, Heidelberg, 2013) Arkin, E.; Tekinerdogan, Bedir; Imre, K.M.
The trend from single processor to parallel computer architectures has increased the importance of parallel computing. To support parallel computing it is important to map parallel algorithms to a computing platform that consists of multiple parallel processing nodes. In general different alternative mappings can be defined that perform differently with respect to the quality requirements for power consumption, efficiency and memory usage. The mapping process can be carried out manually for platforms with a limited number of processing nodes. However, for exascale computing in which hundreds of thousands of processing nodes are applied, the mapping process soon becomes intractable. To assist the parallel computing engineer we provide a model-driven approach to analyze, model, and select feasible mappings. We describe the developed toolset that implements the corresponding approach together with the required metamodels and model transformations. We illustrate our approach for the well-known complete exchange algorithm in parallel computing. © 2013 Springer-Verlag.
Open Access
Model-driven transformations for mapping parallel algorithms on parallel computing platforms
(MDHPCL, 2013) Arkin, E.; Tekinerdoğan, Bedir
One of the important problems in parallel computing is the mapping of the parallel algorithm to the parallel computing platform. Hereby, for each parallel node the corresponding code for the parallel nodes must be implemented. For platforms with a limited number of processing nodes this can be done manually. However, in case the parallel computing platform consists of hundreds of thousands of processing nodes then the manual coding of the parallel algorithms becomes intractable and error-prone. Moreover, a change of the parallel computing platform requires considerable effort and time of coding. In this paper we present a model-driven approach for generating the code of selected parallel algorithms to be mapped on parallel computing platforms. We describe the required platform independent metamodel, and the model-to-model and the model-to-text transformation patterns. We illustrate our approach for the parallel matrix multiplication algorithm. Copyright © 2013 for the individual papers by the papers' authors.
Open Access
New formulations for the hop-constrained minimum spanning tree problem via Sherali and Driscoll's tightened Miller-Tucker-Zemlin constraints
(Elsevier, 2010) Akgün, İbrahim
Given an undirected network with positive edge costs and a natural number p, the hop-constrained minimum spanning tree problem (HMST) is the problem of finding a spanning tree with minimum total cost such that each path starting from a specified root node has no more than p hops (edges). In this paper, the new models based on the Miller-Tucker-Zemlin (MTZ) subtour elimination constraints are developed and computational results together with comparisons against MTZ-based, flow-based, and hop-indexed formulations are reported. The first model is obtained by adapting the MTZ-based Asymmetric Traveling Salesman Problem formulation of Sherali and Driscoll [18] and the other two models are obtained by combining topology-enforcing and MTZ-related constraints offered by Akgün and Tansel (submitted for publication) [20] for HMST with the first model appropriately. Computational studies show that the best LP bounds of the MTZ-based models in the literature are improved by the proposed models. The best solution times of the MTZ-based models are not improved for optimally solved instances. However, the results for the harder, large-size instances imply that the proposed models are likely to produce better solution times. The proposed models do not dominate the flow-based and hop-indexed formulations with respect to LP bounds. However, good feasible solutions can be obtained in a reasonable amount of time for problems for which even the LP relaxations of the flow-based and hop-indexed formulations can be solved in about 2 days. © 2010 Elsevier Ltd. All rights reserved.
Open Access
Optimizing local memory allocation and assignment through a decoupled approach
(Springer, 2010-10) Diouf, B.; Öztürk, Özcan; Cohen, A.
Software-controlled local memories (LMs) are widely used to provide fast, scalable, power efficient and predictable access to critical data. While many studies addressed LM management, keeping hot data in the LM continues to cause major headache. This paper revisits LM management of arrays in light of recent progresses in register allocation, supporting multiple live-range splitting schemes through a generic integer linear program. These schemes differ in the grain of decision points. The model can also be extended to address fragmentation, assigning live ranges to precise offsets. We show that the links between LM management and register allocation have been underexploited, leaving much fundamental questions open and effective applications to be explored. © 2010 Springer-Verlag.
Open Access
Parallel pruning for k-means clustering on shared memory architectures
(Springer Verlag, 2001) Gürsoy, Attila; Cengiz, Ilker
We have developed and evaluated two parallelization schemes for a tree-based k-means clustering method on shared memory machines. One scheme is to partition the pattern space across processors. We have determined that spatial decomposition of patterns outperforms random decomposition even though random decomposition has almost no load imbalance problem. The other scheme is the parallel traverse of the search tree. This approach solves the load imbalance problem and performs slightly better than the spatial decomposition, but the efficiency is reduced due to thread synchronizations. In both cases, parallel treebased k-means clustering is significantly faster than the direct parallel k-means. © Springer-Verlag Berlin Heidelberg 2001.
Open Access
Profiler and compiler assisted adaptive I/O prefetching for shared storage caches
(ACM, 2008-10) Son, S. W.; Kandemir, M.; Kolcu, I.; Muralidhara, S. P.; Öztürk, Öztürk; Karakoy, M.
I/O prefetching has been employed in the past as one of the mech- anisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this re- duction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our pro- posed scheme improves performance, on average, by 19.9%, 11.9% and http://dx.doi.org/10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used. Copyright 2008 ACM.
Open Access
Realistic modeling of spectator behavior for soccer videogames with CUDA
(2011) Ylmaz, E.; Molla, E.; Yıldız, C.; İşler V.
Soccer has always been one of the most popular videogame genres. When designing a soccer game, designers tend to focus on the game field and game play due to the limited computational resources, and thus the modelling of virtual spectators is paid less attention. In this study we present a novel approach to the modeling of spectator behavior, which treats each spectator as a unique individual. We also propose an independent software layer for sport-based games that simply obtains the game status from the game engine via a simple messaging protocol and computes the spectator behavior accordingly. The result is returned to the game engine, to be used in the animation and rendering of the spectators. Additionally, we offer a customizable spectator knowledge base with well structured XML to minimize coding efforts, while generating individualized behavior. The employed AI is based on fuzzy inference. In order to overcome additional demand for computing realistic spectator behavior, we use GPU parallel computing with CUDA. © 2011 Elsevier Ltd. All rights reserved.
Open Access
Slicing based code parallelization for minimizing inter-processor communication
(ACM, 2009-10) Kandemir, M.; Zhang, Y.; Muralidhara, S. P.; Öztürk, Özcan; Narayanan, S. H. K.
One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested. Copyright 2009 ACM.