Browsing by Subject "Parallel processing systems"

Now showing 1 - 20 of 29

Open Access
Adaptive decomposition and remapping algorithms for object-space-parallel direct volume rendering of unstructured grids
(Academic Press, 2007-01) Aykanat, Cevdet; Cambazoglu, B. B.; Findik, F.; Kurc, T.
Object space (OS) parallelization of an efficient direct volume rendering algorithm for unstructured grids on distributed-memory architectures is investigated. The adaptive OS decomposition problem is modeled as a graph partitioning (GP) problem using an efficient and highly accurate estimation scheme for view-dependent node and edge weighting. In the proposed model, minimizing the cutsize corresponds to minimizing the parallelization overhead due to the data communication and redundant computation/storage while maintaining the GP balance constraint corresponds to maintaining the computational load balance in parallel rendering. A GP-based, view-independent cell clustering scheme is introduced to induce more tractable view-dependent computational graphs for successive visualizations. As another contribution, a graph-theoretical remapping model is proposed as a solution to the general remapping problem and is used in minimization of the cell-data migration overhead. The remapping tool RM-MeTiS is developed by modifying the GP tool MeTiS and is used in partitioning the remapping graphs. Experiments are conducted using benchmark datasets on a 28-node PC cluster to evaluate the performance of the proposed models. © 2006 Elsevier Inc. All rights reserved.
Open Access
Comparison of local and global computation and its implications for the role of optical interconnections in future nanoelectronic systems
(Elsevier, 1993) Özaktaş, Haldun M.; Goodman J. W.
Various methods of simulating diffusion phenomena with parallel hardware are discussed. In particular methods are compared requiring local and global communication among the processors in terms of total computation time. Systolic convolution on a locally connected array is seen to exhibit an asymptotic advantage over Fourier methods on a globally connected array. Whereas this may translate into a numerical advantage for extremely large numbers of ultrafast devices for two-dimensional systems, this is unlikely for three-dimensional systems. Thus global Fourier methods will be advantageous for three-dimensional systems for foreseeable device speeds and system sizes. The fact that optical interconnections are potentially advantageous for implementing the longer connections of such globally connected systems suggests that they can be beneficially employed in future nanoelectronic computers. Heat removal considerations play an important role in our conclusions.
Open Access
A comparison of logical and physical parallel I/O patterns
(SAGE Publications Inc., 1998) Simitci, H.; Reed, D. A.
Although there are several extant studies of parallel scientific application request patterns, there is little experimental data on the correlation of physical I/O patterns with application I/O stimuli. To understand these correlations, the authors have instrumented the SCSI device drivers of the Intel Paragon OSF/1 operating system to record key physical I/O activities, and have correlated this data with the I/O patterns of scientific applications captured via the Pablo analysis toolkit. This analysis shows that disk hardware features profoundly affect the distribution of request delays and that current parallel file systems respond to parallel application I/O patterns in nonscalable ways.
Open Access
A data-level parallel linear-quadratic penalty algorithm for multicommodity network flows
(Association for Computing Machinery, 1994) Pinar, M. C.; Zenios, S. A.
We describe the development of a data-level, massively parallel software system for the solution of multicommodity network flow problems. Using a smooth linear-quadratic penalty (LQP) algorithm we transform the multicommodity network flow problem into a sequence of independent min-cost network flow subproblems. The solution of these problems is coordinated via a simple, dense, nonlinear master program to obtain a solution that is feasible within some user-specified tolerance to the original multicommodity network flow problem. Particular emphasis is placed on the mapping of both the subproblem and master problem data to the processing elements of a massively parallel computer, the Connection Machine CM-2. As a result of this design we can solve large and sparse optimization problems on current SIMD massively parallel architectures. Details of the implementation are reported, together with summary computational results with a set of test problems drawn from a Military Airlift Command application.
Open Access
Data-parallel web crawling models
(Springer, 2004) Cambazoglu, B. B.; Turk, A.; Aykanat, Cevdet
The need to quickly locate, gather, and store the vast amount of material in the Web necessitates parallel computing. In this paper, we propose two models, based on multi-constraint graph-partitioning, for efficient data-parallel Web crawling. The models aim to balance the amount of data downloaded and stored by each processor as well as balancing the number of page requests made by the processors. The models also minimize the total volume of communication during the link exchange between the processors. To evaluate the performance of the models, experimental results are presented on a sample Web repository containing around 915,000 pages. © Springer-Verlag 2004.
Open Access
Domain specific language for deployment of parallel applications on parallel computing platforms
(Association for Computing Machinery, 2014-08) Arkın, E.; Tekinerdoğan, Bedir
To increase the computing performance the current trend is towards applying parallel computing in which parallel tasks are executed on multiple nodes. The deployment of tasks on the computing platform usually impacts the overall performance and as such needs to be modelled carefully. In the architecture design community the deployment viewpoint is an important viewpoint to support this mapping process. In general the derived deployment views are visual notations that are not amenable for run-time processing, and do not scale well for deployment of large scale parallel applications. In this paper we propose a domain specific language (DSL) for modeling the deployment of parallel applications and for providing automated support for the deployment process. The DSL is based on a metamodel that is derived after a domain analysis on parallel computing. We illustrate the application of the DSL for a traffic simulation system and provide a set of important scenarios for using the DSL. © 2014 ACM.
Open Access
Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems
(Springer, 2006-11) Cambazoğlu, B. Barla; Çatal, A.; Aykanat, Cevdet
Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate the effect of these two index partitioning schemes on query processing. We conduct experiments on a 32-node PC cluster, considering the case where index is completely stored in disk. Performance results are reported for a large (30 GB) document collection using an MPI-based parallel query processing implementation. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
Efficient fast hartley transform algorithms for hypercube-connected multicomputers
(IEEE, 1995) Aykanat, Cevdet; Derviş, A.
Although fast Hartley transform (FHT) provides efficient spectral analysis of real discrete signals, the literature that addresses the parallelization of FHT is extremely rare. FHT is a real transformation and does not necessitate any complex arithmetics. On the other hand, FHT algorithm has an irregular computational structure which makes efficient parallelization harder. In this paper, we propose a efficient restructuring for the sequential FHT algorithm which brings regularity and symmetry to the computational structure of the FHT. Then, we propose an efficient parallel FHT algorithm for medium-to-coarse grain hypercube multicomputers by introducing a dynamic mapping scheme for the restructured FHT. The proposed parallel algorithm achieves perfect load-balance, minimizes both the number and volume of concurrent communications, allows only nearest-neighbor communications and achieves in-place computation and communication. The proposed algorithm is implemented on a 32-node iPSC/21 hypercube multicomputer. High-efficiency values are obtained even for small size FHT problems. © 1995 IEEE
Open Access
Efficient parallel spatial subdivision algorithm for object-based parallel ray tracing
(Pergamon Press, 1994) Aykanat, Cevdet; İşler, V.; Özgüç, B.
Parallel ray tracing of complex scenes on multicomputers requires the distribution of both computation and scene data to the processors. This is carried out during preprocessing and usually consumes too much time and memory. The paper presents an efficient parallel subdivision algorithm that decomposes a given scene into rectangular regions adaptively and maps the resultant regions to the node processors of a multicomputer. The proposed algorithm uses efficient data structures to identify the splitting planes quickly. Furthermore the mapping of the regions and the objects to the node processors is performed while parallel spatial subdivision proceeds. The proposed algorithm is implemented on an Intel iPSC/2 hypercube multicomputer and promising results have been obtained. © 1994.
Open Access
An efficient parallelization technique for high throughput FFT-ASIPs
(IEEE, 2006) Ishebabi H.; Ascheid G.; Meyr H.; Atak, Oğuzhan; Atalar, Abdullah; Arıkan, Erdal
Fast Fourier Transformation (FFT) and it's inverse (IFFT) are used in Orthogonal Frequency Division Multiplexing (OFDM) systems for data (de)modulation. The transformations are the kernel tasks in an OFDM implementation, and are the most processing-intensive ones. Recent trends in the electronic consumer market require OFDM implementations to be flexible, making a trade-off between area, energy-efficiency, flexibility and timing a necessity. This has spurred the development of Application-Specific Instruction-Set Processors (ASIPs) for FFT processing. Parallelization is an architectural parameter that significantly influence design goals. This paper presents an analysis of the efficiency of parallelization techniques for an FFT-ASIP. It is shown that existing techniques are inefficient for high throughput applications such as Ultra Wideband (UWB), because of memory bottlenecks. Therefore, an interleaved execution technique which exploits temporal parallelism is proposed. With this technique, it is possible to meet the throughput requirement of UWB (409.6 Msamples/s) with only 4 non-trivial butterfly units for an ASIP that runs at 400MHz. © 2006 IEEE.
Open Access
Encapsulating multiple communication-cost metrics in partitioning sparse rectangular matrices for parallel matrix-vector multiplies
(SIAM, 2004) Uçar, B.; Aykanat, Cevdet
This paper addresses the problem of one-dimensional partitioning of structurally unsymmetric square and rectangular sparse matrices for parallel matrix-vector and matrix-transpose-vector multiplies. The objective is to minimize the communication cost while maintaining the balance on computational loads of processors. Most of the existing partitioning models consider only the total message volume hoping that minimizing this communication-cost metric is likely to reduce other metrics. However, the total message latency (start-up time) may be more important than the total message volume. Furthermore, the maximum message volume and latency handled by a single processor are also important metrics. We propose a two-phase approach that encapsulates all these four communication-cost metrics. The objective in the first phase is to minimize the total message volume while maintaining the computational-load balance. The objective in the second phase is to encapsulate the remaining three communication-cost metrics. We propose communication-hypergraph and partitioning models for the second phase. We then present several methods for partitioning communication hypergraphs. Experiments on a wide range of test matrices show that the proposed approach yields very effective partitioning results. A parallel implementation on a PC cluster verifies that the theoretical improvements shown by partitioning results hold in practice.
Open Access
Fast optimal load balancing algorithms for 1D partitioning
(Academic Press, 2004) Pınar, A.; Aykanat, Cevdet
The one-dimensional decomposition of nonuniform workload arrays with optimal load balancing is investigated. The problem has been studied in the literature as the "chains-on-chains partitioning" problem. Despite the rich literature on exact algorithms, heuristics are still used in parallel computing community with the "hope" of good decompositions and the "myth" of exact algorithms being hard to implement and not runtime efficient. We show that exact algorithms yield significant improvements in load balance over heuristics with negligible overhead. Detailed pseudocodes of the proposed algorithms are provided for reproducibility. We start with a literature review and propose improvements and efficient implementation tips for these algorithms. We also introduce novel algorithms that are asymptotically and runtime efficient. Our experiments on sparse matrix and direct volume rendering datasets verify that balance can be significantly improved by using exact algorithms. The proposed exact algorithms are 100 times faster than a single sparse-matrix vector multiplication for 64-way decompositions on the average. We conclude that exact algorithms with proposed efficient implementations can effectively replace heuristics. © 2004 Elsevier Inc. All rights reserved.
Open Access
Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
(IEEE, 1999) Catalyurek, U.V.; Aykanat, Cevdet
In this work, we show that the standard graph-partitioning-based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrix-vector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model. The proposed models reduce the decomposition problem to the well-known hypergraph partitioning problem. The recently proposed successful multilevel framework is exploited to develop a multilevel hypergraph partitioning tool PaToH for the experimental verification of our proposed hypergraph models. Experimental results on a wide range of realistic sparse test matrices confirm the validity of the proposed hypergraph models. In the decomposition of the test matrices, the hypergraph models using PaToH and hMeTiS result in up to 63 percent less communication volume (30 to 38 percent less on the average) than the graph model using MeTiS, while PaToH is only 1.3-2.3 times slower than MeTiS on the average.
Open Access
Hypergraph-theoretic partitioning models for parallel web crawling
(Springer, London, 2012) Türk, Ata; Cambazoğlu, B. Barla; Aykanat, Cevdet
Parallel web crawling is an important technique employed by large-scale search engines for content acquisition. A commonly used inter-processor coordination scheme in parallel crawling systems is the link exchange scheme, where discovered links are communicated between processors. This scheme can attain the coverage and quality level of a serial crawler while avoiding redundant crawling of pages by different processors. The main problem in the exchange scheme is the high inter-processor communication overhead. In this work, we propose a hypergraph model that reduces the communication overhead associated with link exchange operations in parallel web crawling systems by intelligent assignment of sites to processors. Our hypergraph model can correctly capture and minimize the number of network messages exchanged between crawlers. We evaluate the performance of our models on four benchmark datasets. Compared to the traditional hash-based assignment approach, significant performance improvements are observed in reducing the inter-processor communication overhead. © 2012 Springer-Verlag London Limited.
Open Access
Mars: A tool-based modeling, animation, and parallel rendering system
(Springer, 1994) Aktıhanoğlu, M.; Özgüç, B.; Aykanat, Cevdet
This paper describes a system for modeling, animating, previewing and rendering articulated objects. The system has a modeler of objects that consists of joints and segments. The animator interactively positions the articulated object in its stick, control vertex, or rectangular prism representation and previews the motion in real time. Then the data representing the motion and the models is sent to a multicomputer [iPSC/2 Hypercube (Intel)]. The frames are rendered in parallel, exploiting the coherence between successive frames, thus cutting down the rendering time significantly. Our main aim is to make a detailed study on rendering of a sequence of 3D scenes. The results show that due to an inherent correlation between the 3D scenes, an efficient rendering can be achieved. © 1994 Springer-Verlag.
Open Access
Model-driven approach for supporting the mapping of parallel algorithms to parallel computing platforms
(Springer, Berlin, Heidelberg, 2013) Arkin, E.; Tekinerdogan, Bedir; Imre, K.M.
The trend from single processor to parallel computer architectures has increased the importance of parallel computing. To support parallel computing it is important to map parallel algorithms to a computing platform that consists of multiple parallel processing nodes. In general different alternative mappings can be defined that perform differently with respect to the quality requirements for power consumption, efficiency and memory usage. The mapping process can be carried out manually for platforms with a limited number of processing nodes. However, for exascale computing in which hundreds of thousands of processing nodes are applied, the mapping process soon becomes intractable. To assist the parallel computing engineer we provide a model-driven approach to analyze, model, and select feasible mappings. We describe the developed toolset that implements the corresponding approach together with the required metamodels and model transformations. We illustrate our approach for the well-known complete exchange algorithm in parallel computing. © 2013 Springer-Verlag.
Open Access
Model-driven transformations for mapping parallel algorithms on parallel computing platforms
(MDHPCL, 2013) Arkin, E.; Tekinerdoğan, Bedir
One of the important problems in parallel computing is the mapping of the parallel algorithm to the parallel computing platform. Hereby, for each parallel node the corresponding code for the parallel nodes must be implemented. For platforms with a limited number of processing nodes this can be done manually. However, in case the parallel computing platform consists of hundreds of thousands of processing nodes then the manual coding of the parallel algorithms becomes intractable and error-prone. Moreover, a change of the parallel computing platform requires considerable effort and time of coding. In this paper we present a model-driven approach for generating the code of selected parallel algorithms to be mapped on parallel computing platforms. We describe the required platform independent metamodel, and the model-to-model and the model-to-text transformation patterns. We illustrate our approach for the parallel matrix multiplication algorithm. Copyright © 2013 for the individual papers by the papers' authors.
Open Access
A novel method for scaling iterative solvers: avoiding latency overhead of parallel sparse-matrix vector multiplies
(Institute of Electrical and Electronics Engineers, 2015) Selvitopi, R. O.; Ozdal, M. M.; Aykanat, Cevdet
In parallel linear iterative solvers, sparse matrix vector multiplication (SpMxV) incurs irregular point-to-point (P2P) communications, whereas inner product computations incur regular collective communications. These P2P communications cause an additional synchronization point with relatively high message latency costs due to small message sizes. In these solvers, each SpMxV is usually followed by an inner product computation that involves the output vector of SpMxV. Here, we exploit this property to propose a novel parallelization method that avoids the latency costs and synchronization overhead of P2P communications. Our method involves a computational and a communication rearrangement scheme. The computational rearrangement provides an alternative method for forming input vector of SpMxV and allows P2P and collective communications to be performed in a single phase. The communication rearrangement realizes this opportunity by embedding P2P communications into global collective communication operations. The proposed method grants a certain value on the maximum number of messages communicated regardless of the sparsity pattern of the matrix. The downside, however, is the increased message volume and the negligible redundant computation. We favor reducing the message latency costs at the expense of increasing message volume. Yet, we propose two iterative-improvement-based heuristics to alleviate the increase in the volume through one-to-one task-to-processor mapping. Our experiments on two supercomputers, Cray XE6 and IBM BlueGene/Q, up to 2,048 processors show that the proposed parallelization method exhibits superior scalable performance compared to the conventional parallelization method.
Open Access
One-dimensional partitioning for heterogeneous systems: theory and practice
(Academic Press, 2008-11) Pınar, A.; Tabak, E. K.; Aykanat, Cevdet
We study the problem of one-dimensional partitioning of nonuniform workload arrays, with optimal load balancing for heterogeneous systems. We look at two cases: chain-on-chain partitioning, where the order of the processors is specified, and chain partitioning, where processor permutation is allowed. We present polynomial time algorithms to solve the chain-on-chain partitioning problem optimally, while we prove that the chain partitioning problem is NP-complete. Our empirical studies show that our proposed exact algorithms produce substantially better results than heuristics, while solution times remain comparable. © 2008 Elsevier Inc. All rights reserved.
Open Access
Online balancing two independent criteria
(Springer, 2008-10) Tse, Savio S.H.
We study the online bicriteria load balancing problem in this paper. We choose a system of distributed homogeneous file servers located in a cluster as the scenario and propose two online approximate algorithms for balancing their loads and required storage spaces. We first revisit the best existing solution for document placement, and rewrite it in our first algorithm by imposing some flexibilities. The second algorithm bounds the load and storage space of each server by less than three times of their trivial lower bounds, respectively; and more importantly, for each server, the value of at least one parameter is far from its worst case. The time complexities for both algorithm are O(logM). © 2008 Springer Berlin Heidelberg.