Browsing by Subject "Parallelization"

Now showing 1 - 20 of 22

Open Access
Accurate and efficient solutions of electromagnetic problems with the multilevel fast multipole algorithm
(2009) Ergül, Özgür Salih
The multilevel fast multipole algorithm (MLFMA) is a powerful method for the fast and efficient solution of electromagnetics problems discretized with large numbers of unknowns. This method reduces the complexity of matrix-vector multiplications required by iterative solvers and enables the solution of largescale problems that cannot be investigated by using traditional methods. On the other hand, efficiency and accuracy of solutions via MLFMA depend on many parameters, such as the integral-equation formulation, discretization, iterative solver, preconditioning, computing platform, parallelization, and many other details of the numerical implementation. This dissertation is based on our efforts to develop sophisticated implementations of MLFMA for the solution of real-life scattering and radiation problems involving three-dimensional complicated objects with arbitrary geometries.
Open Access
Accurate solutions of extremely large integral-equation problems in computational electromagnetics
(IEEE, 2013-02) Ergül, Ö; Gürel, Levent
Accurate simulations of real-life electromagnetics problems with integral equations require the solution of dense matrix equations involving millions of unknowns. Solutions of these extremely large problems cannot be achieved easily, even when using the most powerful computers with state-of-the-art technology. However, with the multilevel fast multipole algorithm (MLFMA) and parallel MLFMA, we have been able to obtain full-wave solutions of scattering problems discretized with hundreds of millions of unknowns. Some of the complicated real-life problems (such as scattering from a realistic aircraft) involve geometries that are larger than 1000 wavelengths. Accurate solutions of such problems can be used as benchmarking data for many purposes and even as reference data for high-frequency techniques. Solutions of extremely large canonical benchmark problems involving sphere and National Aeronautics and Space Administration (NASA) Almond geometries are presented, in addition to the solution of complicated objects, such as the Flamme. The parallel implementation is also extended to solve very large dielectric problems, such as dielectric lenses and photonic crystals.
Open Access
Auto-tuning similarity search algorithms on multi-core architectures
(2013) Gedik, B.
In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or k -NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi k -NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or "tuning knobs", are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms. © 2013 Springer Science+Business Media New York.
Open Access
Autopipelining for data stream processing
(Institute of Electrical and Electronics Engineers, 2013) Tang, Y.; Gedik, B.
Stream processing applications use online analytics to ingest high-rate data sources, process them on-the-fly, and generate live results in a timely manner. The data flow graph representation of these applications facilitates the specification of stream computing tasks with ease, and also lends itself to possible runtime exploitation of parallelization on multicore processors. While the data flow graphs naturally contain a rich set of parallelization opportunities, exploiting them is challenging due to the combinatorial number of possible configurations. Furthermore, the best configuration is dynamic in nature; it can differ across multiple runs of the application, and even during different phases of the same run. In this paper, we propose an autopipelining solution that can take advantage of multicore processors to improve throughput of streaming applications, in an effective and transparent way. The solution is effective in the sense that it provides good utilization of resources by dynamically finding and exploiting sources of pipeline parallelism in streaming applications. It is transparent in the sense that it does not require any hints from the application developers. As a part of our solution, we describe a light-weight runtime profiling scheme to learn resource usage of operators comprising the application, an optimization algorithm to locate best places in the data flow graph to explore additional parallelism, and an adaptive control scheme to find the right level of parallelism. We have implemented our solution in an industrial-strength stream processing system. Our experimental evaluation based on microbenchmarks, synthetic workloads, as well as real-world applications confirms that our design is effective in optimizing the throughput of stream processing applications without requiring any changes to the application code. © 1990-2012 IEEE.
Open Access
Efficient solution of the combined-field integral equation with the parallel multilevel fast multipole algorithm
(IEEE, 2007-08) Gürel, Levent; Ergül, Özgür
We present fast and accurate solutions of large-scale scattering problems formulated with the combined-field integral equation. Using the multilevel fast multipole algorithm (MLFMA) parallelized on a cluster of computers, we easily solve scattering problems that are discretized with tens of millions of unknowns. For the efficient parallelization of MLFMA, we propose a hierarchical partitioning scheme based on distributing the multilevel tree among the processors with an improved load-balancing. The accuracy of the solutions is demonstrated on scattering problems involving spheres of various radii from 80λ to 110λ. In addition to canonical problems, we also present the solution of real-life problems involving complicated targets with large dimensions. © 2007 IEEE.
Open Access
Elastic scaling for data stream processing
(IEEE Computer Society, 2014) Gedik, B.; Schneider S.; Hirzel M.; Wu, Kun-Lung
This article addresses the profitability problem associated with auto-parallelization of general-purpose distributed data stream processing applications. Auto-parallelization involves locating regions in the application's data flow graph that can be replicated at run-time to apply data partitioning, in order to achieve scale. In order to make auto-parallelization effective in practice, the profitability question needs to be answered: How many parallel channels provide the best throughput? The answer to this question changes depending on the workload dynamics and resource availability at run-time. In this article, we propose an elastic auto-parallelization solution that can dynamically adjust the number of channels used to achieve high throughput without unnecessarily wasting resources. Most importantly, our solution can handle partitioned stateful operators via run-time state migration, which is fully transparent to the application developers. We provide an implementation and evaluation of the system on an industrial-strength data stream processing platform to validate our solution. © 1990-2012 IEEE.
Open Access
Fast and accurate solutions of extremely large integral-equation problems discretised with tens of millions of unknowns
(The Institution of Engineering and Technology, 2007) Gürel, Levent; Ergül, Özgür
The solution of extremely large scattering problems that are formulated by integral equations and discretised with tens of millions of unknowns is reported. Accurate and efficient solutions are performed by employing a parallel implementation of the multilevel fast multipole algorithm. The effectiveness of the implementation is demonstrated on a sphere problem containing more than 33 million unknowns, which is the largest integral-equation problem ever solved to our knowledge.
Open Access
A hierarchical partitioning strategy for an efficient parallelization of the multilevel fast multipole algorithm
(IEEE, 2009) Ergül, Özgür; Gürel, Levent
We present a novel hierarchical partitioning strategy for the efficient parallelization of the multilevel fast multipole algorithm (MLFMA) on distributed-memory architectures to solve large-scale problems in electromagnetics. Unlike previous parallelization techniques, the tree structure of MLFMA is distributed among processors by partitioning both clusters and samples of fields at each level. Due to the improved load-balancing, the hierarchical strategy offers a higher parallelization efficiency than previous approaches, especially when the number of processors is large. We demonstrate the improved efficiency on scattering problems discretized with millions of unknowns. In addition, we present the effectiveness of our algorithm by solving very large scattering problems involving a conducting sphere of radius 210 wavelengths and a complicated real-life target with a maximum dimension of 880 wavelengths. Both of the objects are discretized with more than 200 million unknowns.
Open Access
Joker: elastic stream processing with organic adaptation
(Elsevier, 2020) Kahveci, Basri; Gedik, Buğra
This paper addresses the problem of auto-parallelization of streaming applications. We propose an online parallelization optimization algorithm that adjusts the degree of pipeline and data parallelism in a joint manner. We define an operator development API and a flexible parallel execution model to form a basis for the optimization algorithm. The operator interface unifies the development of different types of operators and makes operator properties visible in order to enable safe optimizations. The parallel execution model splits a data flow graph into regions. A region contains the longest sequence of compatible operators that are amenable to data parallelism as a whole and can be further parallelized with pipeline parallelism. We also develop a stream processing run-time, named Joker, to scale the execution of streaming applications in a safe, transparent, dynamic, and automatic manner. This ability is called organic adaptation. Joker implements the runtime machinery to execute a data flow graph with any parallelization configuration and most importantly change this configuration at run-time with low cost in the presence of partitioned stateful operators, in a way that is transparent to the application developers. Joker continuously monitors the run-time performance, and runs the optimization algorithm to resolve bottlenecks and scale the application by adjusting the degree of pipeline and data parallelism. The experimental evaluation based on micro-benchmarks and real-world applications showcase that our solution accomplishes elasticity by finding an effective parallelization configuration.
Open Access
Memory-efficient boundary-preserving tetrahedralization of large three-dimensional meshes
(Springer Science and Business Media Deutschland GmbH, 2023-05-09) Erkoç, Ziya; Güdükbay, Uğur; Si. H.
We propose a divide-and-conquer algorithm to tetrahedralize three-dimensional meshes in a boundary-preserving fashion. It consists of three stages: Input Partitioning, Surface Closure, and Merge. We frst partition the input into several pieces to reduce the problem size. We apply 2D Triangulation to close the open boundaries to make new pieces watertight. Each piece is then sent to TetGen, a Delaunay-based tetrahedral mesh generator tool that forms the basis for our implementation. We fnally merge each tetrahedral mesh to calculate the fnal solution. In addition, we apply post-processing to remove the vertices we introduced during the input partitioning stage to preserve the input triangles. The beneft of our approach is that it can reduce peak memory usage or increase the speed of the process. It can even tetrahedralize meshes that TetGen cannot do due to the peak memory requirement.
Open Access
Memory-efficient constrained delaunay tetrahedralization of large three-dimensional triangular meshes
(2022-07) Erkoç, Ziya
We propose a divide-and-conquer algorithm that can solve the Constrained De-launay Tetrahedralization (CDT) problem. It consists of three stages: Input Partitioning, Surface Closure, and Merge. We first partition the input into sev-eral pieces to reduce the problem size. We apply 2D Triangulation to close the open boundaries to make new pieces watertight. Each piece is then sent to Tet-Gen [Hang Si, “TetGen, a Delaunay-Based Quality Tetrahedral Mesh Generator”, ACM Transactions on Mathematical Software, Vol. 41, No. 2, Article No. 11, 36 pages, January 2015] for processing. We finally merge each tetrahedral mesh to calculate the final solution. In addition, we apply post-processing to remove vertices we introduced during the input partitioning stage to preserve the in-put triangles. An alternative approach that does not insert new vertices and eliminates the need for post-processing is also possible but not robust. The benefit of our method is that it can reduce memory usage or increase the speed of the process. It can even tetrahedralize meshes that TetGen cannot do due to the memory’s insufficiency. We also observe that this method can increase the overall tetrahedral mesh quality.
Open Access
Parallel hardware and software implementations for electromagnetic computations
(2005) Bozbulut, Ali Rıza
Multilevel fast multipole algorithm (MLFMA) is an accurate frequencydomain electromagnetics solver that reduces the computational complexity and memory requirement significantly. Despite the advantages of the MLFMA, the maximum size of an electromagnetic problem that can be solved on a single processor computer is still limited by the hardware resources of the system, i.e., memory and processor speed. In order to go beyond the hardware limitations of single processor systems, parallelization of the MLFMA, which is not a trivial task, is suggested. This process requires the parallel implementations of both hardware and software. For this purpose, we constructed our own parallel computer clusters and parallelized our MLFMA program by using message-passing paradigm to solve electromagnetics problems. In order to balance the work load and memory requirement over the processors of multiprocessors systems, efficient load balancing techniques and algorithms are included in this parallel code. As a result, we can solve large-scale electromagnetics problems accurately and rapidly with parallel MLFMA solver on parallel clusters.
Open Access
Parallel sparse matrix vector multiplication techniques for shared memory architectures
(2014) Başaran, Mehmet
SpMxV (Sparse matrix vector multiplication) is a kernel operation in linear solvers in which a sparse matrix is multiplied with a dense vector repeatedly. Due to random memory access patterns exhibited by SpMxV operation, hardware components such as prefetchers, CPU caches, and built in SIMD units are under-utilized. Consequently, limiting parallelization efficieny. In this study we developed; • an adaptive runtime scheduling and load balancing algorithms for shared memory systems, • a hybrid storage format to help effectively vectorize sub-matrices, • an algorithm to extract proposed hybrid sub-matrix storage format. Implemented techniques are designed to be used by both hypergraph partitioning powered and spontaneous SpMxV operations. Tests are carried out on Knights Corner (KNC) coprocessor which is an x86 based many-core architecture employing NoC (network on chip) communication subsystem. However, proposed techniques can also be implemented for GPUs (graphical processing units).
Open Access
Parallel-MLFMA solution of CFIE discretized with tens of millions of unknowns
(Institution of Engineering and Technology, 2007) Ergül, Özgür; Gürel, Levent
We consider the solution of large scattering problems in electromagnetics involving three-dimensional arbitrary geometries with closed surfaces. The problems are formulated accurately with the combined-field integral equation and the resulting dense matrix equations are solved iteratively by employing the multilevel fast multipole algorithm (MLFMA). With an efficient parallelization of MLFMA on relatively inexpensive computing platforms using distributed-memory architectures, we easily solve large-scale problems that are discretized with tens of millions of unknowns. Accuracy of the solutions is demonstrated on scattering problems involving spheres of various sizes, including a sphere of radius 110 λ discretized with 41,883,638 unknowns, which is the largest integral-equation problem ever solved, to the best of our knowledge. In addition to canonical problems, we also present the solution of real-life problems involving complicated targets with large dimensions.
Open Access
Pipelined fission for stream programs with dynamic selectivity and partitioned state
(2014-12) Özsema, Habibe Güldamla
There is an ever increasing rate of digital information available in the form of online data streams. In many application domains, high throughput processing of such data is a critical requirement for keeping up with the soaring input rates. Data stream processing is a computational paradigm that aims at addressing this challenge by processing data streams in an on-the-fly manner. In this thesis, we study the problem of automatically parallelizing data stream processing applications to improve throughput. The parallelization is automatic in the sense that stream programs are written sequentially by the application developers and are parallelized by the system. We adopt the asynchronous data flow model for our work, where operators often have dynamic selectivity and are stateful. We solve the problem of pipelined fission, in which the original sequential program is parallelized by taking advantage of both pipeline and data parallelism at the same time. Our solution supports partitioned stateful data parallelism with dynamic selectivity and is designed for shared-memory multi-core machines. We first develop a cost-based formulation to express pipelined fission as an optimization problem. The bruteforce solution of this problem takes a very long time for moderately sized stream programs. Accordingly, we develop a heuristic algorithm that can quickly, but approximately, solve this problem. We provide an extensive evaluation studying the performance of our solution, including simulations and experiments with an industrial-strength Data Stream Processing Systems (DSPS). Our results show good scalability for applications that contain sufficient parallelism, closeness to optimal performance for the algorithm.
Open Access
A scratch-pad memory aware dynamic loop scheduling algorithm
(IEEE, 2008-03) Öztürk, Özcan; Kandemir, M.; Narayanan, S. H. K.
Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments. © 2008 IEEE.
Open Access
Site-based partitioning and repartitioning techniques for parallel pagerank computation
(Institute of Electrical and Electronics Engineers, 2011-05) Cevahir, A.; Aykanat, Cevdet; Turk, A.; Cambazoglu, B. B.
The PageRank algorithm is an important component in effective web search. At the core of this algorithm are repeated sparse matrix-vector multiplications where the involved web matrices grow in parallel with the growth of the web and are stored in a distributed manner due to space limitations. Hence, the PageRank computation, which is frequently repeated, must be performed in parallel with high-efficiency and low-preprocessing overhead while considering the initial distributed nature of the web matrices. Our contributions in this work are twofold. We first investigate the application of state-of-the-art sparse matrix partitioning models in order to attain high efficiency in parallel PageRank computations with a particular focus on reducing the preprocessing overhead they introduce. For this purpose, we evaluate two different compression schemes on the web matrix using the site information inherently available in links. Second, we consider the more realistic scenario of starting with an initially distributed data and extend our algorithms to cover the repartitioning of such data for efficient PageRank computation. We report performance results using our parallelization of a state-of-the-art PageRank algorithm on two different PC clusters with 40 and 64 processors. Experiments show that the proposed techniques achieve considerably high speedups while incurring a preprocessing overhead of several iterations (for some instances even less than a single iteration) of the underlying sequential PageRank algorithm. © 2011 IEEE.
Open Access
Solution of extremely large integral-equation problems
(IEEE, 2007) Ergül, Özgür; Malas, Tahir; Gürel, Levent
We report the solution of extremely large integral-equation problems involving electromagnetic scattering from conducting bodies. By orchestrating diverse activities, such as the multilevel fast multipole algorithm, iterative methods, preconditioning techniques, and parallelization, we are able to solve scattering problems that are discretized with tens of millions of unknowns. Specifically, we report the solution of a closed geometry containing 42 million unknowns and an open geometry containing 20 million unknowns, which are the largest problems of their classes, to the best of our knowledge.
Open Access
Solution of large-scale scattering problems with the multilevel fast multipole algorithm parallelized on distributed-memory architectures
(IEEE, 2007) Ergül, Özgür; Gürel, Levent
We present the solution of large-scale scattering problems involving three-dimensional closed conducting objects with arbitrary shapes. With an efficient parallelization of the multilevel fast multipole algorithm on relatively inexpensive computational platforms using distributed-memory architectures, we perform the iterative solution of integral-equation formulations that are discretized with tens of millions of unknowns. In addition to canonical problems, we also present the solution of real-life problems involving complicated targets with large dimensions.
Open Access
Solutions of large integral-equation problems with preconditioned MLFMA
(IEEE, 2007) Ergül, Özgür; Malas, Tahir; Ünal, Alper; Gürel, Levent
We report the solution of the largest integral-equation problems in computational electromagnetics. We consider matrix equations obtained from the discretization of the integral-equation formulations that are solved iteratively by employing parallel multilevel fast multipole algorithm (MLFMA). With the efficient parallelization of MLFMA, scattering and radiation problems with millions of unknowns are easily solved on relatively inexpensive computational platforms. For the iterative solutions of the matrix equations, we are able to obtain accelerated convergence even for ill-conditioned matrix equations using advanced preconditioning schemes, such as nested preconditioned based on an approximate MLFMA. By orchestrating these diverse activities, we have been able to solve a closed geometry formulated with the CFIE containing 33 millions of unknowns and an open geometry formulated with the EFIE containing 12 millions of unknowns, which are the largest problems of their classes, to the best of our knowledge.