Browsing by Subject "Parallel computing"

Now showing 1 - 20 of 23

Open Access
Coloring for distributed-memory-parallel Gauss-Seidel algorithm
(2019-09) Koçak, Onur
Gauss-Seidel is a well-known iterative method for solving linear system of equations. The computations performed on Gauss-Seidel sweeps are sequential in nature since each component of new iterations depends on previously computed results. Graph coloring is widely used for extracting parallelism in Gauss-Seidel by eliminating data dependencies caused by precedence in the calculations. In this thesis, we present a method to provide a better coloring for distributed-memoryparallel Gauss-Seidel algorithm. Our method utilizes combinatorial approaches including graph partitioning and balanced graph coloring in order to decrease the number of colors while maintaining a computational load balance among the color classes. Experiments performed on irregular sparse problems arising from various scientific applications show that our model effectively reduces the required number of colors thus the number of parallel sweeps in the Gauss-Seidel algorithm.
Open Access
Domain specific language for deployment of parallel applications on parallel computing platforms
(Association for Computing Machinery, 2014-08) Arkın, E.; Tekinerdoğan, Bedir
To increase the computing performance the current trend is towards applying parallel computing in which parallel tasks are executed on multiple nodes. The deployment of tasks on the computing platform usually impacts the overall performance and as such needs to be modelled carefully. In the architecture design community the deployment viewpoint is an important viewpoint to support this mapping process. In general the derived deployment views are visual notations that are not amenable for run-time processing, and do not scale well for deployment of large scale parallel applications. In this paper we propose a domain specific language (DSL) for modeling the deployment of parallel applications and for providing automated support for the deployment process. The DSL is based on a metamodel that is derived after a domain analysis on parallel computing. We illustrate the application of the DSL for a traffic simulation system and provide a set of important scenarios for using the DSL. © 2014 ACM.
Open Access
Efficient overlapped FFT algorithms for hypercube-connected multicomputers
(1994) Aykanat, Cevdet; Dervis, A.
In this work, we propose parallel FFT algorithms, for medium-to-coarse grain hypercube-connected multicomputers, which are more elegant and efficient than the existing ones. The proposed algorithms achieve perfect load-balance for the efficient simplified-butterfly scheme, minimize the communication overhead by decreasing both the number and the volume of concurrent communications. Communication and computation cannot be overlapped easily due to the strong data dependencies in the FFT algorithm. In this paper, we propose a restructuring for the FFT algorithm which enables overlapping each communication with one fifth of the local computations involved in a stage. Two of the proposed parallel FET algorithms achieve overlapping by exploiting this restructuring while using the efficient table-lookup scheme for complex coefficients. The proposed algorithms are implemented on an Intel’s 32-node iPSC/2 hypercube multicomputer. High efficiency values are obtained even for small size FFT problems. © 1994, Taylor & Francis Group, LLC. All rights reserved.
Open Access
Encapsulating multiple communication-cost metrics in partitioning sparse rectangular matrices for parallel matrix-vector multiplies
(SIAM, 2004) Uçar, B.; Aykanat, Cevdet
This paper addresses the problem of one-dimensional partitioning of structurally unsymmetric square and rectangular sparse matrices for parallel matrix-vector and matrix-transpose-vector multiplies. The objective is to minimize the communication cost while maintaining the balance on computational loads of processors. Most of the existing partitioning models consider only the total message volume hoping that minimizing this communication-cost metric is likely to reduce other metrics. However, the total message latency (start-up time) may be more important than the total message volume. Furthermore, the maximum message volume and latency handled by a single processor are also important metrics. We propose a two-phase approach that encapsulates all these four communication-cost metrics. The objective in the first phase is to minimize the total message volume while maintaining the computational-load balance. The objective in the second phase is to encapsulate the remaining three communication-cost metrics. We propose communication-hypergraph and partitioning models for the second phase. We then present several methods for partitioning communication hypergraphs. Experiments on a wide range of test matrices show that the proposed approach yields very effective partitioning results. A parallel implementation on a PC cluster verifies that the theoretical improvements shown by partitioning results hold in practice.
Open Access
Hypergraph models for parallel sparse matrix-matrix multiplication
(2015-09) Akbudak, Kadir
Multiplication of two sparse matrices (i.e., sparse matrix-matrix multiplication, which is abbreviated as SpGEMM) is a widely used kernel in many applications such as molecular dynamics simulations, graph operations, and linear programming. We identify parallel formulations of SpGEMM operation in the form of C = AB for distributed-memory architectures. Using these formulations, we propose parallel SpGEMM algorithms that have the multiplication and communication phases: The multiplication phase consists of local SpGEMM computations without any communication and the communication phase consists of transferring required input/output matrices. For these algorithms, three hypergraph models are proposed. These models are used to partition input and output matrices simultaneously. The input matrices A and B are partitioned in one dimension in all of these hypergraph models. The output matrix C is partitioned in two dimensions, which is nonzero-based in the rst hypergraph model, and it is partitioned in one dimension in the second and third models. In partitioning of these hypergraph models, the constraint on vertex weights corresponds to computational load balancing among processors for the multiplication phase of the proposed SpGEMM algorithms, and the objective, which is minimizing cutsize de ned in terms of costs of the cut hyperedges, corresponds to minimizing the communication volume due to transferring required matrix entries in the communication phase of the SpGEMM algorithms. We also propose models for reducing the total number of messages while maintaining balance on communication volumes handled by processors during the communication phase of the SpGEMM algorithms. An SpGEMM library for distributed memory architectures is developed in order to verify the empirical validity of our models. The library uses MPI (Message Passing Interface) for performing communication in the parallel setting. The developed SpGEMM library is run on SpGEMM instances from various realistic applications and the experiments are carried out on a large parallel IBM BlueGene/Q system, named JUQUEEN. In the experimentation of the proposed hypergraph models, high speedup values are observed.
Open Access
Large-scale solutions of electromagnetics problems using the multilevel fast multipole algorithm and physical optics
(2015-04) Hidayetoğlu, Mert
Integral equations provide full-wave (accurate) solutions of Helmholtz-type electromagnetics problems. The multilevel fast multipole algorithm (MLFMA) discretizes the equations and solves them numerically with O(NLogN) complexity, where N is the number of unknowns. For solving large-scale problems, MLFMA is parallelized on distributed-memory architectures. Despite the low complexity and parallelization, the computational requirements of MLFMA solutions grow immensely in terms of CPU time and memory when extremely-large geometries (in wavelengths) are involved. The thesis provides computational and theoretical techniques for solving large-scale electromagnetics problems with lower computational requirements. One technique is the out-of-core implementation for reducing the required memory via employing disk space for storing large data. Additionally, a pre-processing parallelization strategy, which eliminates memory bottlenecks, is presented. Another technique, MPI+OpenMP parallelization, uses distributed-memory and shared-memory schemes together in order to maintain the parallelization efficiency with high number of processes/threads. The thesis also includes the out-of-core implementation in conjunction with the MPI+OpenMP parallelization. With the applied techniques, full-wave solutions involving up to 1.3 billion unknowns are achieved with 2 TB memory. Physical optics is a high-frequency approximation, which provides fast solutions of scattering problems with O(N) complexity. A parallel physical optics algorithm is presented in order to achieve fast and approximate solutions. Finally, a hybrid integral-equation and physical-optics solution methodology is introduced.
Open Access
Latency-centric models and methods for scaling sparse operations
(2016-07) Selvitopi, Oğuz
Parallelization of sparse kernels and operations on large-scale distributed memory systems remains as a major challenge due to ever-increasing scale of modern high performance computing systems and multiple con icting factors that affect the parallel performance. The low computational density and high memory footprint of sparse operations add to these challenges by implying more stressed communication bottlenecks and make fast and effcient parallelization models and methods imperative for scalable performance. Sparse operations are usually performed with structures related to sparse matrices and matrices are partitioned prior to the execution for distributing computations among processors. Although the literature is rich in this aspect, it still lacks the techniques that embrace multiple factors affecting communication performance in a complete and just manner. In this thesis, we investigate models and methods for intelligent partitioning of sparse matrices that strive for achieving a more correct approximation of the communication performance. To improve the communication performance of parallel sparse operations, we mainly focus on reducing the latency bottlenecks, which stand as a major component in the overall communication cost. Besides these, our approaches consider already adopted communication cost metrics in the literature as well and aim to address as many cost metrics as possible. We propose one-phase and two-phase partitioning models to reduce the latency cost in one-dimensional (1D) and two-dimensional (2D) sparse matrix partitioning, respectively. The model for 1D partitioning relies on the commonly adopted recursive bipartitioning framework and it uses novel structures to capture the relations that incur latency. The models for 2D partitioning aim to improve the performance of solvers for nonsymmetric linear systems by using different partitions for the vectors in the solver and uses that exibility to exploit the latency cost. Our findings indicate that the latency costs should deffinitely be considered in order to achieve scalable performance on distributed memory systems.
Open Access
Matrix factorization with stochastic gradient descent for recommender systems
(2019-02) Aktulum, Ömer Faruk
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popularly in recommendation systems. Alternating Least Squares (ALS), Stochastic Gradient Descent (SGD) and Coordinate Descent (CD) are among the methods used commonly while factorizing large matrices. SGD-based factorization has proven to be the most successful among these methods after Netflix and KDDCup competitions where the winners’ algorithms relied on methods based on SGD. Parallelization of SGD then became a hot topic and studied extensively in the literature in recent years. We focus on parallel SGD algorithms developed for shared memory and distributed memory systems. Shared memory parallelizations include works such as HogWild, FPSGD and MLGF-MF, and distributed memory parallelizations include works such as DSGD, GASGD and NOMAD. We design a survey that contains exhaustive analysis of these studies, and then particularly focus on DSGD by implementing it through message-passing paradigm and testing its performance in terms of convergence and speedup. In contrast to the existing works, many real-wold datasets are used in the experiments that we produce using published raw data. We show that DSGD is a robust algorithm for large-scale datasets and achieves near-linear speedup with fast convergence rates.
Open Access
Model-driven transformations for mapping parallel algorithms on parallel computing platforms
(MDHPCL, 2013) Arkin, E.; Tekinerdoğan, Bedir
One of the important problems in parallel computing is the mapping of the parallel algorithm to the parallel computing platform. Hereby, for each parallel node the corresponding code for the parallel nodes must be implemented. For platforms with a limited number of processing nodes this can be done manually. However, in case the parallel computing platform consists of hundreds of thousands of processing nodes then the manual coding of the parallel algorithms becomes intractable and error-prone. Moreover, a change of the parallel computing platform requires considerable effort and time of coding. In this paper we present a model-driven approach for generating the code of selected parallel algorithms to be mapped on parallel computing platforms. We describe the required platform independent metamodel, and the model-to-model and the model-to-text transformation patterns. We illustrate our approach for the parallel matrix multiplication algorithm. Copyright © 2013 for the individual papers by the papers' authors.
Open Access
One-dimensional partitioning for heterogeneous systems: theory and practice
(Academic Press, 2008-11) Pınar, A.; Tabak, E. K.; Aykanat, Cevdet
We study the problem of one-dimensional partitioning of nonuniform workload arrays, with optimal load balancing for heterogeneous systems. We look at two cases: chain-on-chain partitioning, where the order of the processors is specified, and chain partitioning, where processor permutation is allowed. We present polynomial time algorithms to solve the chain-on-chain partitioning problem optimally, while we prove that the chain partitioning problem is NP-complete. Our empirical studies show that our proposed exact algorithms produce substantially better results than heuristics, while solution times remain comparable. © 2008 Elsevier Inc. All rights reserved.
Open Access
A parallel boundary element formulation for tracking multiple particle trajectories in Stoke's flow for microfluidic applications
(Tech Science Press, 2015) Karakaya, Z.; Baranoʇlu, B.; Çetin B.; Yazici, A.
A new formulation for tracking multiple particles in slow viscous flow for microfluidic applications is presented. The method employs the manipulation of the boundary element matrices so that finally a system of equations is obtained relating the rigid body velocities of the particle to the forces applied on the particle. The formulation is specially designed for particle trajectory tracking and involves successive matrix multiplications for which SMP (Symmetric multiprocessing) parallelisation is applied. It is observed that present formulation offers an efficient numerical model to be used for particle tracking and can easily be extended for multiphysics simulations in which several physics involved. Copyright © 2015 Tech Science Press.
Open Access
Parallel image restoration using surrogate constraint methods
(Academic Press, 2007) Uçar, B.; Aykanat, Cevdet; Pınar, M. Ç.; Malas, T.
When formulated as a system of linear inequalities, the image restoration problem yields huge, unstructured, sparse matrices even for images of small size. To solve the image restoration problem, we use the surrogate constraint methods that can work efficiently for large problems. Among variants of the surrogate constraint method, we consider a basic method performing a single block projection in each step and a coarse-grain parallel version making simultaneous block projections. Using several state-of-the-art partitioning strategies and adopting different communication models, we develop competing parallel implementations of the two methods. The implementations are evaluated based on the per iteration performance and on the overall performance. The experimental results on a PC cluster reveal that the proposed parallelization schemes are quite beneficial.
Open Access
Parallel stochastic gradient descent with sub-iterations on distributed memory systems
(2022-02) Çağlayan, Orhun
We investigate parallelization of the stochastic gradient descent (SGD) algorithm for solving the matrix completion problem. Applications in the literature show that stale data usage and communication costs are important concerns that affect the performance of parallel SGD applications. We first briefly visit the stochastic gradient descent algorithm and matrix partitioning for parallel SGD. Then we define the stale data problem and communication costs. In order to improve the performance of parallel SGD, we propose a new algorithm with intra-iteration synchronization (referred as sub-iterations) to decrease communication costs and stale data usage. Experimental results show that using sub-iterations can de-crease staleness up to 95% and communication volume up to 47%. Furthermore, using sub-iterations can improve test error up to 60% when compared to the conventional parallel SGD implementation that does not use sub-iterations.
Open Access
Parallelization of Sparse Matrix Kernels for big data applications
(Springer, 2016) Selvitopu, Oğuz; Akbudak, Kadir; Aykanat, Cevdet; Pop, F.; Kołodziej, J.; Di Martino, B.
Analysis of big data on large-scale distributed systems often necessitates efficient parallel graph algorithms that are used to explore the relationships between individual components. Graph algorithms use the basic adjacency list representation for graphs, which can also be viewed as a sparse matrix. This correspondence between representation of graphs and sparse matrices makes it possible to express many important graph algorithms in terms of basic sparse matrix operations, where the literature for optimization is more mature. For example, the graph analytic libraries such as Pegasus and Combinatorial BLAS use sparse matrix kernels for a wide variety of operations on graphs. In this work, we focus on two such important sparse matrix kernels: Sparse matrix–sparse matrix multiplication (SpGEMM) and sparse matrix–dense matrix multiplication (SpMM). We propose partitioning models for efficient parallelization of these kernels on large-scale distributed systems. Our models aim at reducing and improving communication volume while balancing computational load, which are two vital performance metrics on distributed systems. We show that by exploiting sparsity patterns of the matrices through our models, the parallel performance of SpGEMM and SpMM operations can be significantly improved.
Open Access
Parpatoh : A 2D-parallel hypergraph partitioning tool
(2006) Karaca, Evren
Hypergraph partitioning is a process that is being used to find solutions for optimization problems in various areas, including parallel volume rendering, parallel information retrieval and VLSI circuit design. While the current partitioning methods are adequate for hypergraphs up to certain size, these methods start to fail once the problem size exceeds this threshold. In this thesis we introduce ParPaToH, a parallel p-way hypergraph partitioning tool that makes use of a 2-D decomposition to reduce the communication overhead and implements a parallel-computing friendly version of the accepted multi-level partitioning paradigm to generate its partitioning. We present new concepts in hypergraph partitioning that lead to a coarse-grain parallel solution. Finally, we discuss the implementation of the tool in detail and present experimental results to demonstrate its effectiveness.
Open Access
Partitioning sparse matrices for parallel preconditioned iterative methods
(Society for Industrial and Applied Mathematics, 2007) Uçar, B.; Aykanat, Cevdet
This paper addresses the parallelization of the preconditioned iterative methods that use explicit preconditioners such as approximate inverses. Parallelizing a full step of these methods requires the coefficient and preconditioner matrices to be well partitioned. We first show that different methods impose different partitioning requirements for the matrices. Then we develop hypergraph models to meet those requirements. In particular, we develop models that enable us to obtain partitionings on the coefficient and preconditioner matrices simultaneously. Experiments on a set of unsymmetric sparse matrices show that the proposed models yield effective partitioning results. A parallel implementation of the right preconditioned BiCGStab method on a PC cluster verifies that the theoretical gains obtained by the models hold in practice. © 2007 Society for Industrial and Applied Mathematics.
Open Access
Progressive refinement radiosity on ring-connected multicomputers
(ACM, 1993-10) Çapın, Tolga K.; Aykanat, Cevdet; Özgüç, Bülent
The progressive refinement method is investigated for parallelization on ring-connected multicomputers. A synchronous scheme, based on static task assignment, is proposed, in order to achieve better coherence during the parallel light distribution computations. An efficient global circulation scheme is proposed for the parallel light distribution computations, which reduces the total volume of concurrent communication by an asymptotical factor. The proposed parallel algorithm is implemented on a ring-embedded Intel's iPSC/2 hypercube multicomputer. Load balance quality of the proposed static assignment schemes are evaluated experimentally. The effect of coherence in the parallel light distribution computations on the shooting patch selection sequence is also investigated.
Open Access
A recursive graph bipartitioning algorithm by vertex separators with fixed vertices for permuting sparse matrices into block diagonal form with overlap
(2011) Acer, Seher
Solving sparse system of linear equations Ax=b using preconditioners can be effi- ciently parallelized using graph partitioning tools. In this thesis, we investigate the problem of permuting a sparse matrix into a block diagonal form with overlap which is to be used in the parallelization of the multiplicative schwarz preconditioner. A matrix is said to be in block diagonal form with overlap if the diagonal blocks may overlap. In order to formulate this permutation problem as a graph-theoretical problem, we introduce a restricted version of the graph partitioning by vertex separator problem (GPVS), where the objective is to find a vertex partition whose parts are only connected by a vertex separator. The modified problem, we refer as ordered GPVS problem (oGPVS), is restricted such that the parts should exhibit an ordered form where the consecutive parts can only be connected by a separator. The existing graph partitioning tools are unable to solve the oGPVS problem. Thus, we present a recursive graph bipartitioning algorithm by vertex separators together with a novel vertex fixation scheme so that a GPVS tool supporting fixed vertices can effectively and efficiently be utilized. We also theoretically verified the correctness of the proposed approach devising a necessary and sufficient condition to the feasibility of a oGPVS solution. Experimental results on a wide range of matrices confirm the validity of the proposed approach.
Open Access
Reducing communication overhead in sparse matrix and tensor computations
(2020-08) Karsavuran, Mustafa Ozan
Encapsulating multiple communication cost metrics, i.e., bandwidth and latency, is proven to be important in reducing communication overhead in the parallelization of sparse and irregular applications. Communication hypergraph model was proposed in a two-phase setting for encapsulating multiple communication cost metrics. The reduce-communication hypergraph model suﬀers from failing to correctly encapsulate send-volume balancing. We propose a novel vertex weighting scheme that enables part weights to correctly encode send-volume loads of processors for send-volume balancing. The model also suﬀers from increasing the total communication volume during partitioning. To decrease this increase, we propose a method that utilizes the recursive bipartitioning (RB) paradigm and reﬁnes each bipartition by vertex swaps. For performance evaluation, we consider column-parallel SpMV, which is one of the most widely known applications in which the reduce-task assignment problem arises. Extensive experiments on 313 matrices show that, compared to the existing model, the proposed models achieve considerable improvements in all communication cost metrics. These improvements lead to an average decrease of 30 percent in parallel SpMV time on 512 processors for 70 matrices with high irregularity. We further enhance the reduce-communication hypergraph model so that it also encapsulates the minimization of the maximum number of messages sent by a processor. For this purpose, we propose a novel cutsize metric which we realize using RB paradigm while partitioning the reduce-communication hypergraph. We also introduce a new type of net for the communication hypergraph which models decreasing the increase in the total communication volume directly with the partitioning objective. Experiments on 300 matrices show that the proposed models achieve considerable improvements in communication cost metrics which lead to better column-parallel SpMM time on 1024 processors. We propose a hypergraph model for general medium-grain sparse tensor partitioning which does not enforce any topological constraint on the partitioning. The proposed model is based on splitting the given tensor into nonzero-disjoint component tensors. Then a mode-dependent coarse-grain hypergraph is constructed for each component tensor. A net amalgamation operation is proposed to form a composite medium-grain hypergraph from these mode-dependent coarse-grain hypergraphs to correctly encapsulate the minimization of the communication volume. We propose a heuristic which splits the nonzeros of dense slices to obtain sparse slices in component tensors. We also utilize the well-known RB paradigm to improve the quality of the splitting heuristic. We propose a medium-grain tripartite graph model with the aim of a faster partitioning at the expense of increasing the total communication volume. Parallel experiments conducted on 10 real-world tensors on up to 1024 processors conﬁrm the validity of the proposed hypergraph and graph models.
Open Access
Reducing communication volume overhead in large-scale parallel SpGEMM
(2016-12) Ünsal, Başak
Sparse matrix-matrix multiplication of the form of C = A x B, C = A x A and C = A x AT is a key operation in various domains and is characterized with high complexity and runtime overhead. There exist models for parallelizing this operation in distributed memory architectures such as outer-product (OP), inner-product (IP), row-by-row-product (RRP) and column-by-column-product (CCP). We focus on row-by-row-product due to its convincing performance, row preprocessing overhead and no symbolic multiplication requirement. The paral- lelization via row-by-row-product model can be achieved using bipartite graphs or hypergraphs. For an efficient parallelization, we can consider multiple volume- based metrics to be reduced such as total volume, maximum volume, etc. Existing approaches for RRP model do not encapsulate multiple volume-based metrics. In this thesis, we propose a two-phase approach to reduce multiple volume- based cost metrics. In the first phase, total volume is reduced with a bipartite graph model. In the second phase, we reduce maximum volume while trying to keep the increase in total volume as small as possible. Our experiments show that the proposed approach is effective at reducing multiple volume-based metrics for different forms of SpGEMM operations.