BUIR Repository :: Browsing by Subject "Load balancing"

Browsing by Subject "Load balancing"

Now showing 1 - 20 of 23

Open Access
Balancing computation load and communication overhead with multilevel self organizing maps
(Bilkent University, 2001-07) Bıkmaz, Erdoğan
Open Access
Combined use of prioritized AIMD and flow-based traffic splitting for robust TCP load balancing
(Springer, 2004) Alparslan, O.; Akar, N.; Karasan, E.
In this paper, we propose an AIMD-based TCP load balancing architecture in a backbone network where TCP flows are split between two explicitly routed paths, namely the primary and the secondary paths. We propose that primary paths have strict priority over the secondary paths with respect to packet forwarding and both paths are rate-controlled using ECN marking in the core and AIMD rate adjustment at the ingress nodes. We call this technique "prioritized AIMD". The buffers maintained at the ingress nodes for the two alternative paths help us predict the delay difference between the two paths which forms the basis for deciding on which path to forward a new-coming flow. We provide a simulation study for a large mesh network to demonstrate the efficiency of the proposed approach in terms of the average per-flow goodput and byte blocking rates. © Springer-Verlag Berlin Heidelberg 2004.
Open Access
Data decomposition techniques for parallel tree-based k-means clustering
(Bilkent University, 2002) Şen, Cenk
The main computation in the k-means clustering is distance calculations between cluster centroids and patterns. As the number of the patterns and the number of centroids increases, time needed to complete computations increased. This computational load requires high performance computers and/or algorithmic improvements. The parallel tree-based k-means algorithm on distributed memory machines combines the algorithmic improvements and high computation capacity of the parallel computers to deal with huge datasets. Its performance is affected by the data decomposition technique used. In this thesis, we presented novel data decomposition technique to improve the performance of the parallel tree-based k-means algorithm on distributed memory machines. Proposed tree-based decomposition techniques try to decrease the total number of the distance calculations by assigning processors compact subspaces. The compact subspace improves the performance of the pruning function of the tree-based k-means algorithm. We have implemented the algorithm and have conducted experiments on a PC cluster. Our experimental results demonstrated that the tree-based decomposition technique outperforms the random decomposition and stripwise decomposition techniques.
Open Access
A hypergraph-partitioning based remapping model for image-space parallel volume rendering
(Bilkent University, 2000) Cambazoğlu, Berkant Barla
Ray-casting is a popular direct volume rendering technique, used to explore the content of 3D data. Although this technique is capable of producing high quality visualizations, its slowness prevents the interactive use. The major method to overcome this speed limitation is parallelization. In this work, we investigate the image-space parallelization of ray-casting for distributed memory architectures. The most important issues in image-space parallelization are load balancing and minimization of the data redistribution overhead introduced at successive visualization instances. Load balancing in volume rendering requires the estimation of screen work load correctly. For this purpose, we tested three different load assignment schemes. Since the data used in this work is made up of unstructured tetrahedral grids, clusters of data were used instead of cells, for efficiency purposes. Two different cluster-processor distribution schemes are employed to see the effects of initial data distribution. The major contribution of the thesis comes at the hypergraph partitioning model proposed as a solution to the remapping problem. For this purpose, existing hypergraph partitioning tool PaToH is modified and used as a one-phase remapping tool. The model is tested on a Parsytec CC system and satisfactory results are obtained. Compared to the two-phase jagged partitioning model, our work incurs less preprocessing overhead. At comparable load imbalance values, our hypergraph partitioning model requires 25% less total volume of communication than jagged partitioning on the average.
Open Access
Image-space decomposition algorithms for sort-first parallel volume rendering of unstructured grids
(Bilkent University, 1997) Kutluca, Hüseyin
In this thesis, image-space decomposition algorithms are proposed and utilized for parallel implementation of a direct volume rendering algorithm. Screen space bounding box of a primitive is used to approximate the coverage of the primitive on the screen. Number of bounding boxes in a region is used as a workload of the region. Exact model is proposed as a new workload array scheme to find exact number of bounding boxes in a rectangular region in O(1) time. Chains-on-chains partitioning algorithms are exploited for load balancing in some of the proposed decomposition schemes. Summed area table scheme is utilized to achieve more efficient optimal jagged decomposition and iterative rectilinear decomposition algorithms. These two 2D decomposition algorithms are utilized for image-space decomposition using the exact model. Also, new algorithms that use inverse area heuristic are implemented for image-space decomposition. Orthogonal recursive bisection algorithm with medians of medians scheme is applied on regular mesh and quadtree superimposed on the screen. Hilbert space filling curve is also exploited for image-space decomposition. 12 image-space decomposition algorithms are experimentally evaluated on a common framework with respect to the load balance performance, the number of shared primitives, and execution time of the decomposition algorithms.
Open Access
Improving medium-grain partitioning for scalable sparse tensor decomposition
(Institute of Electrical and Electronics Engineers, 2018) Acer, S.; Torun, T.; Aykanat, Cevdet
Tensor decomposition is widely used in the analysis of multi-dimensional data. The canonical polyadic decomposition (CPD) is one of the most popular decomposition methods and commonly found by the CPD-ALS algorithm. High computational and memory costs of CPD-ALS necessitate the use of a distributed-memory-parallel algorithm for efficiency. The medium-grain CPD-ALS algorithm, which adopts multi-dimensional cartesian tensor partitioning, is one of the most successful distributed CPD-ALS algorithms for sparse tensors. This is because cartesian partitioning imposes nice upper bounds on communication overheads. However, this model does not utilize the sparsity pattern of the tensor to reduce the total communication volume. The objective of this work is to fill this literature gap. We propose a novel hypergraph-partitioning model, CartHP, whose partitioning objective correctly encapsulates the minimization of total communication volume of multi-dimensional cartesian tensor partitioning. Experiments on twelve real-world tensors using up to 1024 processors validate the effectiveness of the proposed CartHP model. Compared to the baseline medium-grain model, CartHP achieves average reductions of 52, 43 and 24 percent in total communication volume, communication time and overall runtime of CPD-ALS, respectively.
Open Access
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
(Elsevier BV, 2016) Acer, S.; Selvitopi, O.; Aykanat, Cevdet
We propose a comprehensive and generic framework to minimize multiple and different volume-based communication cost metrics for sparse matrix dense matrix multiplication (SpMM). SpMM is an important kernel that finds application in computational linear algebra and big data analytics. On distributed memory systems, this kernel is usually characterized with its high communication volume requirements. Our approach targets irregularly sparse matrices and is based on both graph and hypergraph partitioning models that rely on the widely adopted recursive bipartitioning paradigm. The proposed models are lightweight, portable (can be realized using any graph and hypergraph partitioning tool) and can simultaneously optimize different cost metrics besides total volume, such as maximum send/receive volume, maximum sum of send and receive volumes, etc., in a single partitioning phase. They allow one to define and optimize as many custom volume-based metrics as desired through a flexible formulation. The experiments on a wide range of about thousand matrices show that the proposed models drastically reduce the maximum communication volume compared to the standard partitioning models that only address the minimization of total volume. The improvements obtained on volume-based partition quality metrics using our models are validated with parallel SpMM as well as parallel multi-source BFS experiments on two large-scale systems. For parallel SpMM, compared to the standard partitioning models, our graph and hypergraph partitioning models respectively achieve reductions of 14% and 22% in runtime, on average. Compared to the state-of-the-art partitioner UMPa, our graph model is overall 14.5 ï¿½ faster and achieves an average improvement of 19% in the partition quality on instances that are bounded by maximum volume. For parallel BFS, we show on graphs with more than a billion edges that the scalability can significantly be improved with our models compared to a recently proposed two-dimensional partitioning model.
Open Access
Improving the performance of independent task assignment heuristics Minmin, Maxmin and Sufferage
(Institute of Electrical and Electronics Engineers, 2014) Tabak, E. K.; Cambazoglu, B. B.; Aykanat, Cevdet
MinMin, MaxMin, and Sufferage are constructive heuristics that are widely and successfully used in assigning independent tasks to processors in heterogeneous computing systems. All three heuristics are known to run in O(K N2) time in assigning N tasks to K processors. In this paper, we propose an algorithmic improvement that asymptotically decreases the running time complexity of MinMin to O(K N log N) without affecting its solution quality. Furthermore, we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage, obtaining two hybrid algorithms. The motivation behind the former hybrid algorithm is to address the drawback of MaxMin in solving problem instances with highly skewed cost distributions while also improving the running time performance of MaxMin. The latter hybrid algorithm improves the running time performance of Sufferage without degrading its solution quality. The proposed algorithms are easy to implement and we illustrate them through detailed pseudocodes. The experimental results over a large number of real-life data sets show that the proposed fast MinMin algorithm and the proposed hybrid algorithms perform significantly better than their traditional counterparts as well as more recent state-of-the-art assignment heuristics. For the large data sets used in the experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art heuristics, require days, weeks, or even months to produce a solution, whereas all of the proposed algorithms produce solutions within only two or three minutes. © 2013 IEEE.
Open Access
Latency-centric models and methods for scaling sparse operations
(Bilkent University, 2016-07) Selvitopi, Oğuz
Parallelization of sparse kernels and operations on large-scale distributed memory systems remains as a major challenge due to ever-increasing scale of modern high performance computing systems and multiple con icting factors that affect the parallel performance. The low computational density and high memory footprint of sparse operations add to these challenges by implying more stressed communication bottlenecks and make fast and effcient parallelization models and methods imperative for scalable performance. Sparse operations are usually performed with structures related to sparse matrices and matrices are partitioned prior to the execution for distributing computations among processors. Although the literature is rich in this aspect, it still lacks the techniques that embrace multiple factors affecting communication performance in a complete and just manner. In this thesis, we investigate models and methods for intelligent partitioning of sparse matrices that strive for achieving a more correct approximation of the communication performance. To improve the communication performance of parallel sparse operations, we mainly focus on reducing the latency bottlenecks, which stand as a major component in the overall communication cost. Besides these, our approaches consider already adopted communication cost metrics in the literature as well and aim to address as many cost metrics as possible. We propose one-phase and two-phase partitioning models to reduce the latency cost in one-dimensional (1D) and two-dimensional (2D) sparse matrix partitioning, respectively. The model for 1D partitioning relies on the commonly adopted recursive bipartitioning framework and it uses novel structures to capture the relations that incur latency. The models for 2D partitioning aim to improve the performance of solvers for nonsymmetric linear systems by using different partitions for the vectors in the solver and uses that exibility to exploit the latency cost. Our findings indicate that the latency costs should deffinitely be considered in order to achieve scalable performance on distributed memory systems.
Open Access
Load balanced locality-aware parallel SGD on multicore architectures for latent factor based collaborative filtering
(Elsevier BV * North-Holland, 2023-04-20) Gülcan, Selçuk; Özdal, Muhammet Mustafa; Aykanat, Cevdet
We investigate the parallelization of Stochastic Gradient Descent (SGD) for matrix completion on multicore architectures. We provide an experimental analysis of current SGD algorithms to find out their bottlenecks and limitations. Grid-based methods suffer from load imbalance among 2D blocks of the rating matrix, especially when datasets are skewed and sparse. Asynchronous methods, on the other hand, can face cache issues due to their memory access pattern. We propose bin-packing-based block balancing methods that are alternative to the recently proposed BaPa method. We then introduce Locality Aware SGD (LASGD), a grid-based asynchronous parallel SGD algorithm that efficiently utilizes cache by changing nonzero update sequence without affecting factor update order and carefully arranging latent factor matrices in the memory. Combined with our proposed load balancing methods, our experiments show that LASGD performs significantly better than alternative approaches in parallel shared-memory systems.
Open Access
Multipath based traffic engineering in MPLS networks
(Bilkent University, 2002-09) Hökelek, İbrahim
Open Access
A neural network based approach for call admission control in heterogeneous networks
(Zhengzhou University, 2014) Naeem, B.; Ngah, R.; Hashim, S.Z.M.; Maqbool W.; Ali, M.B.
The next generation wireless networks will be based on infrastructure with the support of heterogeneous networks. In such a scenario, the users will be mobile between different networks; therefore the number of handovers that a user has to make will become greater. Thus, at a given instant, there will be great chance that a certain cell does not have capacity to sustain the need of users. This may result in great loss of calls and lead to poor quality of service. Moreover, in the future generation of wireless networks, end users will be able to connect any suitable network amongst available set of heterogeneous networks. This ability of an end user being connected to the network of their choice may also affect network load of various base stations. This necessitates for a suitable call admission control scheme for the implementation of heterogeneous networks in the future. Since the behavior of users arriving at any cell in heterogeneous network is unpredictable, we utilize neural network to model our heterogeneous network to admit network load, therefore the learned neural network is able to estimate when call should be admitted in a new situation. Results obtained indicate that neural network approach solves the problem of call admission control unforeseen real-time scenario. The neural network shows reduced error for the increased values of learning rate and momentum constant.
Open Access
A new load balancing heuristic using self-organizing maps
(Bilkent University, 1999) Atun, Murat
In order to have an optimal performance during an execution of a parallel program, the tasks of the parallel computation must be mapped to processors such that the computational load is distributed as evenly as possible while highly communicating tasks are placed closely. We describe a new algorithm for static load balancing problem based on Kohonen Self-Organizing Maps (SOM) which preserves the neighborhood relationship of tasks. We define the input space of the som algorithm to be a unit square and divide it into "number of processors" regions. The tasks are represented by the neurons which are mapped to the regions randomly. We enforce load balancing by selecting training input from the region of the least loaded processor. We examine the impact of various input selection strategies and neighborhood functions on the accuracy of the mapping. The results show that our algorithm outperforms the other task mapping algorithms with SOMs.
Open Access
One-dimensional partitioning for heterogeneous systems: theory and practice
(Academic Press, 2008-11) Pınar, A.; Tabak, E. K.; Aykanat, Cevdet
We study the problem of one-dimensional partitioning of nonuniform workload arrays, with optimal load balancing for heterogeneous systems. We look at two cases: chain-on-chain partitioning, where the order of the processors is specified, and chain partitioning, where processor permutation is allowed. We present polynomial time algorithms to solve the chain-on-chain partitioning problem optimally, while we prove that the chain partitioning problem is NP-complete. Our empirical studies show that our proposed exact algorithms produce substantially better results than heuristics, while solution times remain comparable. © 2008 Elsevier Inc. All rights reserved.
Open Access
Online balancing two independent criteria
(Springer, 2008-10) Tse, Savio S.H.
We study the online bicriteria load balancing problem in this paper. We choose a system of distributed homogeneous file servers located in a cluster as the scenario and propose two online approximate algorithms for balancing their loads and required storage spaces. We first revisit the best existing solution for document placement, and rewrite it in our first algorithm by imposing some flexibilities. The second algorithm bounds the load and storage space of each server by less than three times of their trivial lower bounds, respectively; and more importantly, for each server, the value of at least one parameter is far from its worst case. The time complexities for both algorithm are O(logM). © 2008 Springer Berlin Heidelberg.
Open Access
Online bicriteria load balancing for distributed file servers
(IEEE, 2008-08) Tse, Savio
We study the online bicriteria load balancing problem in a system of M distributed homogeneous file servers located in a cluster. The load and storage space are assumed to be independent. We propose two online approximate algorithms for balancing the load and required storage space of each server during document placement. Our first algorithm combines the first result In [10] and the upper bound result In [1]. With applying document reallocation, we further obtain improvement and give a smoother tradeoff curve of the upper bounds of load and storage space. This result improves the best existing solutions. The second algorithm Is for theoretical purpose. Its existence proves that the bounds for the load and the required storage space of each server, respectively, are strictly better when document reallocation Is allowed. It enhances the research In applying document reallocation. The time complexities of both algorithms are O(log M); and the cost of document reallocation should be taken into account.
Open Access
Optimal access point selection in multi-channel IEEE 80211 networks
(Bilkent University, 2008) Aydınlı, Mustafa
A wireless access point (WAP or AP) is a device that allows wireless communication devices to connect to a wireless local area network (WLAN). AP usually connects to a wired network, and can relay data between the wireless devices (such as computers or printers) and wired devices on the network. Optimal access point selection is a crucial problem in IEEE 802.11 WLAN networks. Access points (APs) cover a certain area and provides an adequate bandwidth to the users around them. When the area to be covered is large, several APs are necessary. Furthermore in order to mitigate the adverse effects of interference between APs, multi channels are used. In this thesis, a service area is divided into demand clusters (DCs) in which number of users per DC and average traffic rates are known. Next, we calculate the congestion of each AP by using the average traffic load. With our Optimal Access Point Selection Algorithm, we balance the traffic loads in APs using a mixed integer linear programming formulation. This algorithm guarantees that each DC is assigned an AP and there is sufficient received power. Furthermore, the interference between the adjacent APs is controlled so that the received signal to interference and noise ratio at each AP satisfies a minimum level. Interference control is accomplished by using a multi-channel WLAN. In this thesis, both orthogonal (non-overlapping) and non-orthogonal (overlapping) channel assignment schemes are considered. The total interference is computed taking into account both co-channel and inter-channel interferences. The developed AP selection methodology is applied to WLAN designs for several buildings. It is observed from the designated networks that a DC shouldnot need to connect to the closest AP but it may be connected to an AP which may be farther away but less congested. DCs are assigned to APs such that all DCs are covered. The effects of the parameter such as traffic load, receiver sensitivity, number of APs, etc are also studied.
Open Access
Optimizing nonzero-based sparse matrix partitioning models via reducing latency
(Academic Press, 2018) Acer, S.; Selvitopi, O.; Aykanat, Cevdet
For the parallelization of sparse matrix-vector multiplication (SpMV) on distributed memory systems, nonzero-based fine-grain and medium-grain partitioning models attain the lowest communication volume and computational imbalance among all partitioning models. This usually comes, however, at the expense of high message count, i.e., high latency overhead. This work addresses this shortcoming by proposing new fine-grain and medium-grain models that are able to minimize communication volume and message count in a single partitioning phase. The new models utilize message nets in order to encapsulate the minimization of total message count. We further fine-tune these models by proposing delayed addition and thresholding for message nets in order to establish a trade-off between the conflicting objectives of minimizing communication volume and message count. The experiments on an extensive dataset of nearly one thousand matrices show that the proposed models improve the total message count of the original nonzero-based models by up to 27% on the average, which is reflected on the parallel runtime of SpMV as an average reduction of 15% on 512 processors.
Open Access
Parallel algorithms for the solution of large sparse inequality systems on distributed memory architectures
(Bilkent University, 1998) Turna, Esma
In this thesis, several parallel algorithms are proposed and utilized for the solution of large sparse linear inequality systems. The parallelization schemes are developed from the coarse-grain parallel formulation of the surrogate constraint method, based on the partitioning strategy: 1D partitioning and 2D partitioning. Furthermore, a third parallelization scheme is developed for the explicit minimization of the communication overhead in 1D partitioning, by using hypergraph partitioning. Utilizing the hypergraph model, the communication overhead is maintained via a global communication scheme and a local communication scheme. In addition, new algorithms that use the bin packing heuristic are investigated for efficient load balancing in uniform rowwise stripped and checkerboard partitioning. A general class of image recovery problems is formulated as a linear inequality system. The restoration of images blurred by so called point spread functions arising from effects such as misfocus of the photographic device, atmospheric turbulence, etc. is successfully provided with the developed parallel algorithms.
Open Access
Parallel stochastic gradient descent on multicore architectures
(Bilkent University, 2020-09) Gülcan, Selçuk
The focus of the thesis is efficient parallelization of the Stochastic Gradient Descent (SGD) algorithm for matrix completion problems on multicore architectures. Asynchronous methods and block-based methods utilizing 2D grid partitioning for task-to-thread assignment are commonly used approaches for sharedmemory parallelization. However, asynchronous methods can have performance issues due to their memory access patterns, whereas grid-based methods can suffer from load imbalance especially when data sets are skewed and sparse. In this thesis, we first analyze parallel performance bottlenecks of the existing SGD algorithms in detail. Then, we propose new algorithms to alleviate these performance bottlenecks. Specifically, we propose bin-packing-based algorithms to balance thread loads under 2D partitioning. We also propose a grid-based asynchronous parallel SGD algorithm that improves cache utilization by changing the entry update order without affecting the factor update order and rearranging the memory layouts of the latent factor matrices. Our experiments show that the proposed methods perform significantly better than the existing approaches on shared-memory multi-core systems.