Browsing by Subject "Parallelizations"

Now showing 1 - 12 of 12

Open Access
Accelerating the HyperLogLog cardinality estimation algorithm
(Hindawi Limited, 2017) Bozkus, C.; Fraguela, B. B.
In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data. © 2017 Cem Bozkus and Basilio B. Fraguela.
Open Access
Advanced partitioning and communication strategies for the efficient parallelization of the multilevel fast multipole algorithm
(IEEE, 2010) Ergül O.; Gürel, Levent
Large-scale electromagnetics problems can be solved efficiently with the multilevel fast multipole algorithm (MLFMA) [1], which reduces the complexity of matrix-vector multiplications required by iterative solvers from O(N 2) to O(N logN). Parallelization of MLFMA on distributed-memory architectures enables fast and accurate solutions of radiation and scattering problems discretized with millions of unknowns using limited computational resources. Recently, we developed a hierarchical partitioning strategy [2], which provides an efficient parallelization of MLFMA, allowing for the solution of very large problems involving hundreds of millions of unknowns. In this strategy, both clusters (sub-domains) of the multilevel tree structure and their samples are partitioned among processors, which leads to improved load-balancing. We also show that communications between processors are reduced and the communication time is shortened, compared to previous parallelization strategies in the literature. On the other hand, improved partitioning of the tree structure complicates the arrangement of communications between processors. In this paper, we discuss communications in detail when MLFMA is parallelized using the hierarchical partitioning strategy. We present well-organized arrangements of communications in order to maximize the efficiency offered by the improved partitioning. We demonstrate the effectiveness of the resulting parallel implementation on a very large scattering problem involving a conducting sphere discretized with 375 million unknowns. ©2010 IEEE.
Open Access
Code scheduling for optimizing parallelism and data locality
(Springer, 2010-08-09) Yemliha, T.; Kandemir, M.; Öztürk, Özcan; Kultursay, E.; Muralidhara, S. P.
As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag.
Open Access
Comparison of two image-space subdivision algorithms for direct volume rendering on distributed-memory multicomputers
(Springer, 1995-08) Tanin, Egemen; Kurç, Tahsin M.; Aykanat, Cevdet; Özgüç, Bülent
Direct Volume Rendering (DVR) is a powerful technique for visualizing volumetric data sets. However, it involves intensive computations. In addition, most of the volumetric data sets consist of large number of 3D sampling points. Therefore, visualization of such data sets also requires large computer memory space. Hence, DVR is a good candidate for parallelization on distributed-memory multicomputers. In this work, image-space parallelization of Raycasting based DVR for unstructured grids on distributed-memory multicomputers is presented and discussed. In order to visualize unstructured volumetric datasets where grid points of the dataset are irregularly distributed over the 3D space, the underlying algorithms should resolve the point location and view sort problems of the 3D grid points. In this paper, these problems are solved using a Scanline Z-buffer based algorithm. Two image space subdivision heuristics, namely horizontal and recursive rectangular subdivision heuristics, are utilized to distribute the computations evenly among the processors in the rendering phase. The horizontal subdivision algorithm divides the image space into horizontal bands composed of consecutive scanlines. In the recursive subdivision algorithm, the image space is divided into rectangular subregions recursively. The experimental performance evaluation of the horizontal and recursive subdivision algorithms on an IBM SP2 system are presented and discussed. © Springer-Verlag Berlin Heidelberg 1996.
Open Access
Compiler-directed energy reduction using dynamic voltage scaling and voltage islands for embedded systems
(Institute of Electrical and Electronics Engineers, 2013) Ozturk, O.; Kandemir, M.; Chen G.
Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters. © 1968-2012 IEEE.
Open Access
Decomposing linear programs for parallel solution
(Springer, 1995-08) Pınar, Ali; Çatalyürek Ümit V.; Aykanat, Cevdet; Pınar, Mustafa
Coarse grain parallelism inherent in the solution of Linear Programming (LP) problems with block angular constraint matrices has been exploited in recent research works. However, these approaches suffer from unscalability and load imbalance since they exploit only the existing block angular structure of the LP constraint matrix. In this paper, we consider decomposing LP constraint matrices to obtain block angular structures with specified number of blocks for scalable parallelization. We propose hypergraph models to represent LP constraint matrices for decomposition. In these models, the decomposition problem reduces to the well-known hypergraph partitioning problem. A Kernighan-Lin based multiway hypergraph partitioning heuristic is implemented for experimenting with the performance of the proposed hypergraph models on the decomposition of the LP problems selected from NETLIB suite. Initial results are promising and justify further research on other hypergraph partitioning heuristics for decomposing large LP problems. © Springer-Verlag Berlin Heidelberg 1996.
Open Access
Elastic scaling for data stream processing
(IEEE Computer Society, 2014) Gedik, B.; Schneider S.; Hirzel M.; Wu, Kun-Lung
This article addresses the profitability problem associated with auto-parallelization of general-purpose distributed data stream processing applications. Auto-parallelization involves locating regions in the application's data flow graph that can be replicated at run-time to apply data partitioning, in order to achieve scale. In order to make auto-parallelization effective in practice, the profitability question needs to be answered: How many parallel channels provide the best throughput? The answer to this question changes depending on the workload dynamics and resource availability at run-time. In this article, we propose an elastic auto-parallelization solution that can dynamically adjust the number of channels used to achieve high throughput without unnecessarily wasting resources. Most importantly, our solution can handle partitioned stateful operators via run-time state migration, which is fully transparent to the application developers. We provide an implementation and evaluation of the system on an industrial-strength data stream processing platform to validate our solution. © 1990-2012 IEEE.
Open Access
Hierarchical parallelization of the multilevel fast multipole algorithm (MLFMA)
(IEEE, 2013) Gürel, Levent; Ergül, Özgür
Due to its O(N log N) complexity, the multilevel fast multipole algorithm (MLFMA) is one of the most prized algorithms of computational electromagnetics and certain other disciplines. Various implementations of this algorithm have been used for rigorous solutions of large-scale scattering, radiation, and miscellaneous other electromagnetics problems involving 3-D objects with arbitrary geometries. Parallelization of MLFMA is crucial for solving real-life problems discretized with hundreds of millions of unknowns. This paper presents the hierarchical partitioning strategy, which provides a very efficient parallelization of MLFMA on distributed-memory architectures. We discuss the advantages of the hierarchical strategy over previous approaches and demonstrate the improved efficiency on scattering problems discretized with millions of unknowns. © 1963-2012 IEEE.
Open Access
Parallel pruning for k-means clustering on shared memory architectures
(Springer Verlag, 2001) Gürsoy, Attila; Cengiz, Ilker
We have developed and evaluated two parallelization schemes for a tree-based k-means clustering method on shared memory machines. One scheme is to partition the pattern space across processors. We have determined that spatial decomposition of patterns outperforms random decomposition even though random decomposition has almost no load imbalance problem. The other scheme is the parallel traverse of the search tree. This approach solves the load imbalance problem and performs slightly better than the spatial decomposition, but the efficiency is reduced due to thread synchronizations. In both cases, parallel treebased k-means clustering is significantly faster than the direct parallel k-means. © Springer-Verlag Berlin Heidelberg 2001.
Open Access
Reducing MLFMA memory with out-of-core implementation and data-structure parallelization
(IEEE, 2013) Hidayetoğlu, Mert; Karaosmanoğlu, Barışcan; Gürel, Levent
We present two memory-reduction methods for the parallel multilevel fast multipole algorithm (MLFMA). The first method implements out-of-core techniques and the second method parallelizes the pre-processing data structures. Together, these methods decrease parallel MLFMA memory bottlenecks, and hence fast and accurate solutions can be achieved for largescale electromagnetics problems.
Open Access
Scalable linkage across location enhanced services
(CEUR-WS, 2017-08) Basık, Fuat
In this work, we investigate methods for merging spatio-temporal usage and entity records across two location-enhanced services, even when the datasets are semantically different. To address both effectiveness and efficiency, we study this linkage problem in two parts: model and framework. First we discuss models, including κ-l diversity- a concept we developed to capture both spatial and temporal diversity aspects of the linkage, and probabilistic linkage. Second, we aim to develop a framework that brings efficient computation and parallelization support for both models of linkage.
Open Access
Slicing based code parallelization for minimizing inter-processor communication
(ACM, 2009-10) Kandemir, M.; Zhang, Y.; Muralidhara, S. P.; Öztürk, Özcan; Narayanan, S. H. K.
One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested. Copyright 2009 ACM.