Browsing by Subject "Program processors"

Now showing 1 - 12 of 12

Open Access
Accelerating the HyperLogLog cardinality estimation algorithm
(Hindawi Limited, 2017) Bozkus, C.; Fraguela, B. B.
In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data. © 2017 Cem Bozkus and Basilio B. Fraguela.
Open Access
Auto-tuning similarity search algorithms on multi-core architectures
(2013) Gedik, B.
In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or k -NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi k -NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or "tuning knobs", are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms. © 2013 Springer Science+Business Media New York.
Open Access
Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems
(Springer, 2006-11) Cambazoğlu, B. Barla; Çatal, A.; Aykanat, Cevdet
Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate the effect of these two index partitioning schemes on query processing. We conduct experiments on a 32-node PC cluster, considering the case where index is completely stored in disk. Performance results are reported for a large (30 GB) document collection using an MPI-based parallel query processing implementation. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
Effective kernel mapping for OpenCL applications in heterogeneous platforms
(Institute of Electrical and Electronics Engineers, 2012-09) Albayrak, Ömer Erdil; Aktürk, İsmail; Öztürk, Özcan
Many core accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. In this paper, we propose a novel profiling-based kernel mapping algorithm to assign each kernel of an application to the proper device to improve the overall performance of an application. We use profiling information of kernels on different devices and generate a map that identifies which kernel should run on where to improve the overall performance of an application. Initial experiments show that our approach can effectively map kernels on CPU and GPU, and outperforms to a CPU-only and GPU-only approach. © 2012 IEEE.
Open Access
An efficient parallelization technique for high throughput FFT-ASIPs
(IEEE, 2006) Ishebabi H.; Ascheid G.; Meyr H.; Atak, Oğuzhan; Atalar, Abdullah; Arıkan, Erdal
Fast Fourier Transformation (FFT) and it's inverse (IFFT) are used in Orthogonal Frequency Division Multiplexing (OFDM) systems for data (de)modulation. The transformations are the kernel tasks in an OFDM implementation, and are the most processing-intensive ones. Recent trends in the electronic consumer market require OFDM implementations to be flexible, making a trade-off between area, energy-efficiency, flexibility and timing a necessity. This has spurred the development of Application-Specific Instruction-Set Processors (ASIPs) for FFT processing. Parallelization is an architectural parameter that significantly influence design goals. This paper presents an analysis of the efficiency of parallelization techniques for an FFT-ASIP. It is shown that existing techniques are inefficient for high throughput applications such as Ultra Wideband (UWB), because of memory bottlenecks. Therefore, an interleaved execution technique which exploits temporal parallelism is proposed. With this technique, it is possible to meet the throughput requirement of UWB (409.6 Msamples/s) with only 4 non-trivial butterfly units for an ASIP that runs at 400MHz. © 2006 IEEE.
Open Access
Efficient processing of category-restricted queries for web directories
(Springer, 2008-03-04) Altıngövde, İsmail Şengör; Can, Fazlı; Ulusoy, Özgür
We show that a cluster-skipping inverted index (CS-IIS) is a practical and efficient file structure to support category-restricted queries for searching Web directories. The query processing strategy with CS-IIS improves CPU time efficiency without imposing any limitations on the directory size. © 2008 Springer-Verlag Berlin Heidelberg.
Open Access
Emerging accelerator platforms for data centers
(IEEE, 2017-12-04) Özdal, Muhammet Mustafa
CPU and GPU platforms may not be the best options for many emerging compute patterns, which led to a new breed of emerging accelerator platforms. This article gives a comprehensive overview with a focus on commercial platforms.
Open Access
Heuristics for scheduling file-sharing tasks on heterogeneous systems with distributed repositories
(Academic Press, 2007) Kaya, K.; Uçar, B.; Aykanat, Cevdet
We consider the problem of scheduling an application on a computing system consisting of heterogeneous processors and data repositories. The application consists of a large number of file-sharing otherwise independent tasks. The files initially reside on the repositories. The processors and the repositories are connected through a heterogeneous interconnection network. Our aim is to assign the tasks to the processors, to schedule the file transfers from the repositories, and to schedule the executions of tasks on each processor in such a way that the turnaround time is minimized. We propose a heuristic composed of three phases: initial task assignment, task assignment refinement, and execution ordering. We experimentally compare the proposed heuristics with three well-known heuristics on a large number of problem instances. The proposed heuristic runs considerably faster than the existing heuristics and obtains 10-14% better turnaround times than the best of the three existing heuristics. © 2006 Elsevier Inc. All rights reserved.
Open Access
Improving application behavior on heterogeneous manycore systems through kernel mapping
(2013) Albayrak, O. E.; Akturk, I.; Ozturk, O.
Many-core accelerators are being more frequently deployed to improve the system processing capabilities. In such systems, application mapping must be enhanced to maximize utilization of the underlying architecture. Especially, in graphics processing units (GPUs), mapping kernels that are part of multi-kernel applications has a great impact on overall performance, since kernels may exhibit different characteristics on different CPUs and GPUs. While some kernels run faster on GPUs, others may perform better in CPUs. Thus, heterogeneous execution may yield better performance than executing the application only on a CPU or only on a GPU. In this paper, we investigate on two approaches: a novel profiling-based adaptive kernel mapping algorithm to assign each kernel of an application to the proper device, and a Mixed-Integer Programming (MIP) implementation to determine optimal mapping. We utilize profiling information for kernels on different devices and generate a map that identifies which kernel should run where in order to improve the overall performance of an application. Initial experiments show that our approach can efficiently map kernels on CPUs and GPUs, and outperforms CPU-only and GPU-only approaches. © 2013 Elsevier B.V. All rights reserved.
Open Access
Progressive refinement radiosity on ring-connected multicomputers
(ACM, 1993-10) Çapın, Tolga K.; Aykanat, Cevdet; Özgüç, Bülent
The progressive refinement method is investigated for parallelization on ring-connected multicomputers. A synchronous scheme, based on static task assignment, is proposed, in order to achieve better coherence during the parallel light distribution computations. An efficient global circulation scheme is proposed for the parallel light distribution computations, which reduces the total volume of concurrent communication by an asymptotical factor. The proposed parallel algorithm is implemented on a ring-embedded Intel's iPSC/2 hypercube multicomputer. Load balance quality of the proposed static assignment schemes are evaluated experimentally. The effect of coherence in the parallel light distribution computations on the shooting patch selection sequence is also investigated.
Open Access
Task assignment in heterogeneous computing systems
(Academic Press, 2006-01) Ucar, B.; Aykanat, Cevdet; Kaya, K.; Ikinci, M.
The problem of task assignment in heterogeneous computing systems has been studied for many years with many variations. We consider the version in which communicating tasks are to be assigned to heterogeneous processors with identical communication links to minimize the sum of the total execution and communication costs. Our contributions are three fold: a task clustering method which takes the execution times of the tasks into account; two metrics to determine the order in which tasks are assigned to the processors; a refinement heuristic which improves a given assignment. We use these three methods to obtain a family of task assignment algorithms including multilevel ones that apply clustering and refinement heuristics repeatedly. We have implemented eight existing algorithms to test the proposed methods. Our refinement algorithm improves the solutions of the existing algorithms by up to 15% and the proposed algorithms obtain better solutions than these refined solutions. © 2005 Elsevier Inc. All righs reserved.
Open Access
A template-based design methodology for graph-parallel hardware accelerators
(IEEE, 2017-05) Ayupov, A.; Yeşil, Şerif; Özdal, Muhammet Mustafa; Kim, T.; Burns, S.; Öztürk, Özcan
Graph applications have been gaining importance in the last decade due to emerging big data analytics problems such as Web graphs, social networks, and biological networks. For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. It has been shown that specialized hardware accelerators can achieve much better power and energy efficiency compared to the general purpose CPUs and GPUs. In this paper, we present a template-based methodology specifically targeted for hardware accelerator design of big-data graph applications. Important architectural features that are key for energy efficient execution are implemented in a common template. The proposed template-based methodology is used to design hardware accelerators for different graph applications with little effort. Compared to an application-specific high-level synthesis methodology, we show that the proposed methodology can generate hardware accelerators with up to 18× better energy efficiency and requires less design effort.