Browsing by Subject "GPU"

Now showing 1 - 7 of 7

Open Access
Effective kernel mapping for OpenCL applications in heterogeneous platforms
(Institute of Electrical and Electronics Engineers, 2012-09) Albayrak, Ömer Erdil; Aktürk, İsmail; Öztürk, Özcan
Many core accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. In this paper, we propose a novel profiling-based kernel mapping algorithm to assign each kernel of an application to the proper device to improve the overall performance of an application. We use profiling information of kernels on different devices and generate a map that identifies which kernel should run on where to improve the overall performance of an application. Initial experiments show that our approach can effectively map kernels on CPU and GPU, and outperforms to a CPU-only and GPU-only approach. © 2012 IEEE.
Embargo
Hardware acceleration for Swin Transformers at the edge
(2024-05) Esergün, Yunus
While deep learning models have greatly enhanced visual processing abilities, their implementation in edge environments with limited resources can be challenging due to their high energy consumption and computational requirements. Swin Transformer is a prominent mechanism in computer vision that differs from traditional convolutional approaches. It adopts a hierarchical approach to interpreting images. A common strategy that improves the efficiency of deep learning algorithms during inference is clustering. Locality-Sensitive Hashing (LSH) is a mechanism that implements clustering and leverages the inherent redundancy within Transformers to identify and exploit computational similarities. This the-sis introduces a hardware accelerator for Swin Transformer implementation with LSH in edge computing settings. The main goal is to reduce energy consumption while improving performance with custom hardware components. Specifically, our custom hardware accelerator design utilizes LSH clustering in Swin Transformers to decrease the amount of computation required. We tested our accelerator with two different state-of-the-art datasets, namely, Imagenet-1K and CIFAR-100. Our results demonstrate that the hardware accelerator enhances the processing speed of the Swin Transformer when compared to GPU-based implementations. More specifically, our accelerator improves performance by 1.35x while reducing the power consumption to 5-6 Watts instead of 19 Watts in the baseline GPU setting. We observe these improvements with a negligible decrease in model accuracy of less than 1%, confirming the effectiveness of our hardware accelerator design in edge computing environments with limited resources.
Open Access
Parallel sparse matrix vector multiplication techniques for shared memory architectures
(2014) Başaran, Mehmet
SpMxV (Sparse matrix vector multiplication) is a kernel operation in linear solvers in which a sparse matrix is multiplied with a dense vector repeatedly. Due to random memory access patterns exhibited by SpMxV operation, hardware components such as prefetchers, CPU caches, and built in SIMD units are under-utilized. Consequently, limiting parallelization efficieny. In this study we developed; • an adaptive runtime scheduling and load balancing algorithms for shared memory systems, • a hybrid storage format to help effectively vectorize sub-matrices, • an algorithm to extract proposed hybrid sub-matrix storage format. Implemented techniques are designed to be used by both hypergraph partitioning powered and spontaneous SpMxV operations. Tests are carried out on Knights Corner (KNC) coprocessor which is an x86 based many-core architecture employing NoC (network on chip) communication subsystem. However, proposed techniques can also be implemented for GPUs (graphical processing units).
Open Access
Particle based modeling and simulation of natural phenomena
(2010) Bayraktar, Serkan
This thesis is about modeling and simulation of fluids and cloth-like deformable objects by the physically-based simulation paradigm. Simulated objects are modeled with particles and their interaction with each other and the environment is defined by particle-to-particle forces. We propose several improvements over the existing particle simulation techniques. Neighbor search algorithms are crucial for the performance efficiency and robustness of a particle system. We present a sorting-based neighbor search method which operates on a uniform grid, and can be parallelizable. We improve upon the existing fluid surface generation methods so that our method captures surface details better since we consider the relative position of fluid particles to the fluid surface. We investigate several alternatives of particle interaction schema (i.e. Smoothed Particle Hydrodynamics, the Discrete Element Method, and Lennard-Jones potential) for the purpose of defining fluid-fluid, fluid-cloth, fluid-boundary interaction forces. We also propose a practical way to simulate knitwear and its interaction with fluids. We employ capillary pressure–based forces to simulate the absorption of fluid particles by knitwear. We also propose a method to simulate the flow of miscible fluids. Our particle simulation system is implement to exploit parallel computing capabilities of the commodity computers. Specifically, we implemented the proposed methods on multicore CPUs and programmable graphics boards. The experiments show that our method is computationally efficient and produces realistic results.
Open Access
Scheduling for heterogeneous systems in accelerator-rich environments
(Springer, 2021-05-25) Yesil, S.; Ozturk, Ozcan
The world is creating ever more data and the applications are required to deal with ever-increasing datasets. To process such datasets heterogeneous and manycore accelerators are being deployed in various computing systems to improve energy efficiency. In this work, we present a runtime management system designed for such heterogeneous systems with manycore accelerators. More specifically, we design a resource-based runtime management system that considers application characteristics and respective execution properties on the nodes and accelerators. We propose scheduling heuristics and run time environment solutions to achieve better throughput and reduced energy in computing systems with different accelerators. We give implementation details about our framework; show different scheduling algorithms, and present experimental evaluation of our system. We also compare our approaches with an optimal scheme where integer linear programming approach has been implemented for mapping applications on the heterogeneous system. While it is possible to extend the proposed framework to a wide variety of accelerators, our initial focus is on Graphics Processing Units (GPUs). Our experimental evaluations show that including accelerator support in the management framework improves energy consumption and execution time significantly. We believe that this approach has the potential to provide an effective solution for next generation accelerator-based computing systems.
Open Access
Simulation of a flowing snow avalanche using molecular dynamics
(2010) Güçer, Denizhan
This thesis presents an approach for modeling and simulation of a flowing snow avalanche, which is formed of dry and liquefied snow that slides down a slope, by using molecular dynamics and discrete element method. A particle system is utilized as a base method for the simulation and marching cubes with real-time shaders are employed for rendering. A uniform grid based neighbor search algorithm is used for collision detection for inter-particle and particle-terrain interactions. A mass-spring model of collision resolution is employed to mimic compressibility of snow and particle attraction forces are put into use between particles and terrain surface. In order to achieve greater performance, general purpose GPU language and multi-threaded program-ming is utilized for collision detection and resolution. The results are dis-played with different combinations of rendering methods for the realistic re-presentation of the flowing avalanche.
Open Access
The solution of large-scale electromagnetic problems with MLFMA on Single-GPU systems
(2022-01) Erkal, Mehmet Fatih
Advancements in computer technology introduce many new hardware infrastructures with high-performance processing powers. In recent years, the graphics processing unit (GPU) has been one of the popular choices that have been being used in computational engineering fields because of its massively parallel processing capacity and its easy coding structure compatible with new programming systems. The full-wave solution of large-scale electromagnetic (EM) scattering problems with traditional methods has very dense computational operations, and thus additional hardware accelerations become an indispensable demand, especially for practical and industrial applications. In this context, the GPU implementation of full-wave electromagnetic solvers such as multi-level fast multiple algorithm (MLFMA) has shown a trend in the literature for the last decade. However, the GPUs also have many restrictions and bottlenecks when implementing large-scale EM scattering problems with full-wave solvers. Limited random-access-memory (RAM) capacities and data transmission delays are the major bottlenecks. In this study, we propose a matrix partitioning scheme to overcome the RAM restriction of GPUs to be able to solve electrically large size problems in single-GPU systems with MLFMA while acquiring reasonable accelerations by considering different implementation approaches. For this purpose, the Single-Instruction-Multiple-Data (SIMD) structure of GPU is considered for each stage of the MLFMA to check its compatibility. In addition, different operators of MLFMA are fine-tuned on the GPU scale to minimize the overall effect of data transfer and device latency. The preliminary analyses show that significant time efficiencies can be obtained for the different parts of MLFMA as well as eliminating the RAM restriction. The numerical results demonstrate the overall efficiencies of our proposed solution for the bottlenecks of GPU and also validate the expected accelerations for the solution of large-scale EM problems involving electrically large canonical geometries and real-life targets such as an aircraft and a missile geometry.