Browsing by Subject "Partitioning"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Open Access Balance preserving min-cut replication set for a K-way hypergraph partitioning(2010) Yazıcı, VolkanReplication is a widely used technique in information retrieval and database systems for providing fault-tolerance and reducing parallelization and processing costs. Combinatorial models based on hypergraph partitioning are proposed for various problems arising in information retrieval and database systems. We consider the possibility of using vertex replication to improve the quality of hypergraph partitioning. In this study, we focus on the Balance Preserving Min-Cut Replication Set (BPMCRS) problem, where we are initially given a maximum replication capacity and a K-way hypergraph partition with an initial imbalance ratio. The objective in the BPMCRS problem is finding optimal vertex replication sets for each part of the given partition such that the initial cutsize of the partition is improved as much as possible and the initial imbalance is either preserved or reduced under the given replication capacity constraint. In order to address the BPMCRS problem, we propose a model based on a unique blend of coarsening and integer linear programming (ILP) schemes. This coarsening algorithm is based on the Dulmage-Mendelsohn decomposition. Experiments show that the ILP formulation coupled with the Dulmage-Mendelsohn decomposition-based coarsening provides high quality results in feasible execution times for reducing the cost of a given K-way hypergraph partition.Item Open Access Comparison of partitioning techniques for two-level iterative solvers on large, sparse Markov chains(SIAM, 2000) Dayar T.; Stewart, W. J.Experimental results for large, sparse Markov chains, especially the ill-conditioned nearly completely decomposable (NCD) ones, are few. We believe there is need for further research in this area, specifically to aid in the understanding of the effects of the degree of coupling of NCD Markov chains and their nonzero structure on the convergence characteristics and space requirements of iterative solvers. The work of several researchers has raised the following questions that led to research in a related direction: How must one go about partitioning the global coefficient matrix into blocks when the system is NCD and a two-level iterative solver (such as block SOR) is to be employed? Are block partitionings dictated by the NCD form of the stochastic one-step transition probability matrix necessarily superior to others? Is it worth investigating alternative partitionings? Better yet, for a fixed labeling and partitioning of the states, how does the performance of block SOR (or even that of point SOR) compare to the performance of the iterative aggregation-disaggregation (IAD) algorithm? Finally, is there any merit in using two-level iterative solvers when preconditioned Krylov subspace methods are available? We seek answers to these questions on a test suite of 13 Markov chains arising in 7 applications.Item Open Access Parafac-spark: parallel tensor decompositions on spark(2019-08) Bekçe, Selim ErenTensors are higher order matrices, widely used in many data science applications and scienti c disciplines. The Canonical Polyadic Decomposition (also known as CPD/PARAFAC) is a widely adopted tensor factorization to discover and extract latent features of tensors usually applied via alternating squares (ALS) method. Developing e cient parallelization methods of PARAFAC on commodity clusters is important because as common tensor sizes reach billions of nonzeros, a naive implementation would require infeasibly huge intermediate memory sizes. Implementations of PARAFAC-ALS on shared and distributedmemory systems are available, but these systems require expensive cluster setups, are too low level, not compatible with modern tooling and not fault tolerant by design. Many companies and data science communities widely prefer Apache Spark, a modern distributed computing framework with in-memory caching, and Hadoop ecosystem of tools for their ease of use, compatibility, ability to run on commodity hardware and fault tolerance. We developed PARAFAC-SPARK, an e cient, parallel, open-source implementation of PARAFAC on Spark, written in Scala. It can decompose 3D tensors stored in common coordinate format in parallel with low memory footprint by partitioning them as grids and utilizing compressed sparse rows (CSR) format for e cient traversals. We followed and combined many of the algorithmic and methodological improvements of its predecessor implementations on Hadoop and distributed memory, and adapted them for Spark. During the kernel MTTKRP operation, by applying a multi-way dynamic partitioning scheme, we were also able to increase the number of reducers to be on par with the number of cores to achieve better utilization and reduced memory footprint. We ran PARAFAC-SPARK with some real world tensors and evaluated the e ectiveness of each improvement as a series of variants compared with each other, as well as with some synthetically generated tensors up to billions of rows to measure its scalability. Our fastest variant (PS-CSRSX ) is up to 67% faster than our baseline Spark implementation (PS-COO) and up to 10 times faster than the state of art Hadoop implementations.Item Open Access Parallel hardware and software implementations for electromagnetic computations(2005) Bozbulut, Ali RızaMultilevel fast multipole algorithm (MLFMA) is an accurate frequencydomain electromagnetics solver that reduces the computational complexity and memory requirement significantly. Despite the advantages of the MLFMA, the maximum size of an electromagnetic problem that can be solved on a single processor computer is still limited by the hardware resources of the system, i.e., memory and processor speed. In order to go beyond the hardware limitations of single processor systems, parallelization of the MLFMA, which is not a trivial task, is suggested. This process requires the parallel implementations of both hardware and software. For this purpose, we constructed our own parallel computer clusters and parallelized our MLFMA program by using message-passing paradigm to solve electromagnetics problems. In order to balance the work load and memory requirement over the processors of multiprocessors systems, efficient load balancing techniques and algorithms are included in this parallel code. As a result, we can solve large-scale electromagnetics problems accurately and rapidly with parallel MLFMA solver on parallel clusters.Item Open Access A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously(IEEE Computer Society, 2017) Selvitopi, O.; Acer, S.; Aykanat, CevdetIntelligent partitioning models are commonly used for efficient parallelization of irregular applications on distributed systems. These models usually aim to minimize a single communication cost metric, which is either related to communication volume or message count. However, both volume- and message-related metrics should be taken into account during partitioning for a more efficient parallelization. There are only a few works that consider both of them and they usually address each in separate phases of a two-phase approach. In this work, we propose a recursive hypergraph bipartitioning framework that reduces the total volume and total message count in a single phase. In this framework, the standard hypergraph models, nets of which already capture the bandwidth cost, are augmented with message nets. The message nets encode the message count so that minimizing conventional cutsize captures the minimization of bandwidth and latency costs together. Our model provides a more accurate representation of the overall communication cost by incorporating both the bandwidth and the latency components into the partitioning objective. The use of the widely-adopted successful recursive bipartitioning framework provides the flexibility of using any existing hypergraph partitioner. The experiments on instances from different domains show that our model on the average achieves up to 52 percent reduction in total message count and hence results in 29 percent reduction in parallel running time compared to the model that considers only the total volume. © 2016 IEEE.Item Open Access Reducing communication volume overhead in large-scale parallel SpGEMM(2016-12) Ünsal, BaşakSparse matrix-matrix multiplication of the form of C = A x B, C = A x A and C = A x AT is a key operation in various domains and is characterized with high complexity and runtime overhead. There exist models for parallelizing this operation in distributed memory architectures such as outer-product (OP), inner-product (IP), row-by-row-product (RRP) and column-by-column-product (CCP). We focus on row-by-row-product due to its convincing performance, row preprocessing overhead and no symbolic multiplication requirement. The paral- lelization via row-by-row-product model can be achieved using bipartite graphs or hypergraphs. For an efficient parallelization, we can consider multiple volume- based metrics to be reduced such as total volume, maximum volume, etc. Existing approaches for RRP model do not encapsulate multiple volume-based metrics. In this thesis, we propose a two-phase approach to reduce multiple volume- based cost metrics. In the first phase, total volume is reduced with a bipartite graph model. In the second phase, we reduce maximum volume while trying to keep the increase in total volume as small as possible. Our experiments show that the proposed approach is effective at reducing multiple volume-based metrics for different forms of SpGEMM operations.Item Open Access VLSI circuit partitioning for simulation and placement(1993-01) Tahboub, RadwanSimulation time of Very Large Scale Integrated (VLSI) circuits may be improved substantially upon the partitioning of the circuit into several smaller sub-circuits. Node Splitting (NS) is the underlying basis for partitioning of large integrated circuits into several, more manageable, and sometimes similar sub-circuits to enhance computer simulation efficiency. In this thesis, a partitioning scheme based on the NS is used to partition VLSI circuits efficiently. The proposed algorithms will be used as a preprocessing step to increase the efficiency of a VLSI analog circuit simulator designed by the EE Department at Bilkent University. With small modifications, the same algorithms are used to form clusters of transistors based on their interconnections. The clustered circuit will then be partitioned using well known heuristics such as Simulated Annealing and Kernighan-Lin to be used in VLSI placement. The results with this method have been superior to those with the conventional implementations. We have observed a factor of 3-4 speed-up in CPU time, together with 5-10% improvement in the cut size. Experimental results show that the proposed algorithms can be efficiently used in VLSI circuit partitioning for simulation and placement.