Browsing by Subject "Computer architecture"

Now showing 1 - 20 of 29

Open Access
Adaptive compute-phase prediction and thread prioritization to mitigate memory access latency
(ACM, 2014-06) Aktürk, İsmail; Öztürk, Özcan
The full potential of chip multiprocessors remains unex- ploited due to the thread oblivious memory access sched- ulers used in off-chip main memory controllers. This is especially pronounced in embedded systems due to limita- Tions in memory. We propose an adaptive compute-phase prediction and thread prioritization algorithm for memory access scheduling for embedded chip multiprocessors. The proposed algorithm eficiently categorize threads based on execution characteristics and provides fine-grained priori- Tization that allows to differentiate threads and prioritize their memory access requests accordingly. The threads in compute phase are prioritized among the threads in mem- ory phase. Furthermore, the threads in compute phase are prioritized among themselves based on the potential of mak- ing more progress in their execution. Compared to the prior works First-Ready First-Come First-Serve (FR-FCFS) and Compute-phase Prediction with Writeback-Refresh Overlap (CP-WO), the proposed algorithm reduces the execution time of the generated workloads up to 23.6% and 12.9%, respectively. Copyright 2014 ACM.
Open Access
Adaptive routing framework for network on chip architectures
(ACM, 2016-01) Mustafa, Naveed Ul; Öztürk, Özcan; Niar, S.
In this paper we suggest and demonstrate the idea of applying multiple routing algorithms during the execution of a real application mapped on a Network-on-Chip (NoC). Traffic pattern of a real application may change during its execution. As performance of an algorithm depends on the traffic pattern, using the same routing algorithm for the entire span of execution may be inefficient. We study the feasibility of this idea for applications such as SPARSE and MPEG-4 decoder, by applying different routing algorithms. By applying more than one routing algorithms, throughput improves up to 17.37% and 6.74% in the case of SPARSE and MPEG-4 decoder applications, respectively, as compared to the application of single routing algorithm. © 2016 ACM.
Open Access
Application-specific heterogeneous network-on-chip design
(Oxford University Press, 2014) Demirbas, D.; Akturk, I.; Ozturk, O.; Güdükbay, Uğur
As a result of increasing communication demands, application-specific and scalable Network-on-Chips (NoCs) have emerged to connect processing cores and subsystems in Multiprocessor System-on-Chips. A challenge in application-specific NoC design is to find the right balance among different tradeoffs, such as communication latency, power consumption and chip area. We propose a novel approach that generates latency-aware heterogeneous NoC topology. Experimental results show that our approach improves the total communication latency up to 27% with modest power consumption. © 2013 The Author 2013. Published by Oxford University Press on behalf of The British Computer Society.
Open Access
Auto-tuning similarity search algorithms on multi-core architectures
(2013) Gedik, B.
In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or k -NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi k -NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or "tuning knobs", are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms. © 2013 Springer Science+Business Media New York.
Open Access
Big-data streaming applications scheduling based on staged multi-armed bandits
(Institute of Electrical and Electronics Engineers, 2016) Kanoun, K.; Tekin, C.; Atienza, D.; Van Der Schaar, M.
Several techniques have been recently proposed to adapt Big-Data streaming applications to existing many core platforms. Among these techniques, online reinforcement learning methods have been proposed that learn how to adapt at run-time the throughput and resources allocated to the various streaming tasks depending on dynamically changing data stream characteristics and the desired applications performance (e.g., accuracy). However, most of state-of-the-art techniques consider only one single stream input in its application model input and assume that the system knows the amount of resources to allocate to each task to achieve a desired performance. To address these limitations, in this paper we propose a new systematic and efficient methodology and associated algorithms for online learning and energy-efficient scheduling of Big-Data streaming applications with multiple streams on many core systems with resource constraints. We formalize the problem of multi-stream scheduling as a staged decision problem in which the performance obtained for various resource allocations is unknown. The proposed scheduling methodology uses a novel class of online adaptive learning techniques which we refer to as staged multi-armed bandits (S-MAB). Our scheduler is able to learn online which processing method to assign to each stream and how to allocate its resources over time in order to maximize the performance on the fly, at run-time, without having access to any offline information. The proposed scheduler, applied on a face detection streaming application and without using any offline information, is able to achieve similar performance compared to an optimal semi-online solution that has full knowledge of the input stream where the differences in throughput, observed quality, resource usage and energy efficiency are less than 1, 0.3, 0.2 and 4 percent respectively.
Open Access
Code scheduling for optimizing parallelism and data locality
(Springer, 2010-08-09) Yemliha, T.; Kandemir, M.; Öztürk, Özcan; Kultursay, E.; Muralidhara, S. P.
As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag.
Open Access
A data-level parallel linear-quadratic penalty algorithm for multicommodity network flows
(Association for Computing Machinery, 1994) Pinar, M. C.; Zenios, S. A.
We describe the development of a data-level, massively parallel software system for the solution of multicommodity network flow problems. Using a smooth linear-quadratic penalty (LQP) algorithm we transform the multicommodity network flow problem into a sequence of independent min-cost network flow subproblems. The solution of these problems is coordinated via a simple, dense, nonlinear master program to obtain a solution that is feasible within some user-specified tolerance to the original multicommodity network flow problem. Particular emphasis is placed on the mapping of both the subproblem and master problem data to the processing elements of a massively parallel computer, the Connection Machine CM-2. As a result of this design we can solve large and sparse optimization problems on current SIMD massively parallel architectures. Details of the implementation are reported, together with summary computational results with a set of test problems drawn from a Military Airlift Command application.
Open Access
Deploy-DDS: Tool framework for supporting deployment architecture of data distribution service based systems
(ACM, 2014-08) Çelik, T.; Köksal, O.; Tekinerdoğan, Bedir
Data Distribution Service (DDS) is the Object Management Group's (OMG) new standard middleware after Common Object Request Broker Architecture (CORBA), which is becoming increasingly popular. One of the important problems in DDS Based Software Systems is the deployment configuration of DDS modules to the physical resources. In general, this can be done in many different ways whereby each deployment alternative will perform differently. Currently, the deployment configuration is decided after the coding phase and usually performed manually. For large configurations, finding the feasible deployment might require serious rework with costly and time consuming iterations. In this paper, we present the tool Deploy-DDS to support the selection and generation of deployment architectures of DDS based systems. The tool can be used to perform an evaluation during the design phase and generate the selected feasible configuration. © 2014 Authors.
Open Access
An efficient computation model for coarse grained reconfigurable architectures and its applications to a reconfigurable computer
(IEEE, 2010-07) Atak, Oğuzhan; Atalar, Abdullah
The mapping of high level applications onto the coarse grained reconfigurable architectures (CGRA) are usually performed manually by using graphical tools or when automatic compilation is used, some restrictions are imposed to the high level code. Since high level applications do not contain parallelism explicitly, mapping the application directly to CGRA is very difficult. In this paper, we present a middle level Language for Reconfigurable Computing (LRC). LRC is similar to assembly languages of microprocessors, with the difference that parallelism can be coded in LRC. LRC is an efficient language for describing control data flow graphs. Several applications such as FIR, multirate, multichannel filtering, FFT, 2D-IDCT, Viterbi decoding, UMTS and CCSDC turbo decoding, Wimax LDPC decoding are coded in LRC and mapped to the Bilkent Reconfigurable Computer with a performance (in terms of cycle count) close to that of ASIC implementations. The applicability of the computation model to a CGRA having low cost interconnection network has been validated by using placement and routing algorithms. © 2010 IEEE.
Open Access
Efficient parallel spatial subdivision algorithm for object-based parallel ray tracing
(Pergamon Press, 1994) Aykanat, Cevdet; İşler, V.; Özgüç, B.
Parallel ray tracing of complex scenes on multicomputers requires the distribution of both computation and scene data to the processors. This is carried out during preprocessing and usually consumes too much time and memory. The paper presents an efficient parallel subdivision algorithm that decomposes a given scene into rectangular regions adaptively and maps the resultant regions to the node processors of a multicomputer. The proposed algorithm uses efficient data structures to identify the splitting planes quickly. Furthermore the mapping of the regions and the objects to the node processors is performed while parallel spatial subdivision proceeds. The proposed algorithm is implemented on an Intel iPSC/2 hypercube multicomputer and promising results have been obtained. © 1994.
Open Access
Efficient vectorization of forward/backward substitutions in solving sparse linear equations
(IEEE, 1994) Aykanat, Cevdet; Özgü, Özlem; Güven, N.
Vector processors have promised an enormous increase in computing speed for computationally intensive and time-critical power system problems which require the repeated solution of sparse linear equations. Due to short vectors processed in these applications, standard sparsity-based algorithms need to be restructured for efficient vectorization. This paper presents a novel data storage scheme and an efficient vectorization algorithm that exploits the intrinsic architectural features of vector computers such as sectioning and chaining. As the benchmark, the solution phase of the Fast Decoupled Load Flow algorithm is used in simulations. The relative performances of the proposed and existing vectorization schemes are evaluated, both theoretically and experimentally, on IBM 3090/VF.
Open Access
Emerging accelerator platforms for data centers
(IEEE, 2017-12-04) Özdal, Muhammet Mustafa
CPU and GPU platforms may not be the best options for many emerging compute patterns, which led to a new breed of emerging accelerator platforms. This article gives a comprehensive overview with a focus on commercial platforms.
Open Access
Energy efficient architecture for graph analytics accelerators
(IEEE, 2016-06) Özdal, Muhammet Mustafa; Yeşil, Şerif; Kim, T.; Ayupov, A.; Greth, J.; Burns, S.; Öztürk, Özcan
Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area. © 2016 IEEE.
Open Access
Energy reduction in 3D NoCs through communication optimization
(Springer Wien, 2015) Ozturk, O.; Akturk I.; Kadayif I.; Tosun, S.
Network-on-Chip (NoC) architectures and three-dimensional (3D) integrated circuits have been introduced as attractive options for overcoming the barriers in interconnect scaling while increasing the number of cores. Combining these two approaches is expected to yield better performance and higher scalability. This paper explores the possibility of combining these two techniques in a heterogeneity aware fashion. Specifically, on a heterogeneous 3D NoC architecture, we explore how different types of processors can be optimally placed to minimize data access costs. Moreover, we select the optimal set of links with optimal voltage levels. The experimental results indicate significant savings in energy consumption across a wide range of values of our major simulation parameters.
Open Access
Exploiting locality in sparse matrix-matrix multiplication on many-core rchitectures
(IEEE Computer Society, 2017) Akbudak K.; Aykanat, Cevdet
Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form C=A,B on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix A to evenly partition the work across threads with the objective of reducing the number of B-matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for B-matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for C-matrix rows. A similarity graph model is proposed for B-matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations. © 1990-2012 IEEE.
Open Access
Fundamentals of optical interconnections-a review
(IEEE, 1997-06) Özaktaş, Haldun M.
We review some of the relatively fundamental work in the area of optically interconnected digital computing systems. We cover comparisons of optical interconnections with other interconnection media in terms of energy and interconnection density, studies determining the optimal combination of optical and electrical interconnections that should be used, work on free-space optical interconnection architectures, complexity studies, and work on physical and logical system architectures and algorithms. We exclude work on devices, components, materials, and manufacturing.
Open Access
Implications of non-volatile memory as primary storage for database management systems
(IEEE, 2017) Mustafa, Naveed Ul; Armejach, A.; Öztürk, Özcan; Cristal, A.; Unsal, O. S.
Traditional Database Management System (DBMS) software relies on hard disks for storing relational data. Hard disks are cheap, persistent, and offer huge storage capacities. However, data retrieval latency for hard disks is extremely high. To hide this latency, DRAM is used as an intermediate storage. DRAM is significantly faster than disk, but deployed in smaller capacities due to cost and power constraints, and without the necessary persistency feature that disks have. Non-Volatile Memory (NVM) is an emerging storage class technology which promises the best of both worlds. It can offer large storage capacities, due to better scaling and cost metrics than DRAM, and is non-volatile (persistent) like hard disks. At the same time, its data retrieval time is much lower than that of hard disks and it is also byte-addressable like DRAM. In this paper, we explore the implications of employing NVM as primary storage for DBMS. In other words, we investigate the modifications necessary to be applied on a traditional relational DBMS to take advantage of NVM features. As a case study, we have modified the storage engine (SE) of PostgreSQL enabling efficient use of NVM hardware. We detail the necessary changes and challenges such modifications entail and evaluate them using a comprehensive emulation platform. Results indicate that our modified SE reduces query execution time by up to 40% and 14.4% when compared to disk and NVM storage, with average reductions of 20.5% and 4.5%, respectively. © 2016 IEEE.
Open Access
Integrating platform selection rules in the model driven architecture approach
(Springer, Berlin, Heidelberg, 2005) Tekinerdoǧan, B.; Bilir, Sevcan; Abatlevi, Cem
A key issue in the MDA approach is the transformation of platform independent models to platform specific models. Before transforming to a platform specific model, however, it is necessary to select the appropriate platform. Various platforms exist with different properties and the selection of the appropriate platform for the given application requirements is not trivial. An inappropriate selection of a platform, though, may easily lead to unnecessary loss of resources and lower the efficiency of the application development. Unfortunately, the selection of platforms in MDA is currently implicit and lacks systematic support. We propose to integrate so-called platform selection rules in the MDA approach for systematic selection of platforms. The platform selection rules are based on platform domain models that are derived through domain analysis techniques. We show that the selection of platforms is important throughout the whole MDA process and discuss the integration of the platform selection rules in the MDA approach. The platform selection rules have been implemented in the prototypical tool MDA Selector that provides automated support for the selection of a platform. The presented ideas are illustrated for a stock trading system. © Springer-Verlag Berlin Heidelberg 2005.
Open Access
Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors
(Institute of Electrical and Electronics Engineers, 2016) Karsavuran, M. O.; Akbudak K.; Aykanat, Cevdet
Sparse matrix-vector and matrix-transpose-vector multiplication (SpMMTV) repeatedly performed as z ← ATx and y ← A z (or y ← A w) for the same sparse matrix A is a kernel operation widely used in various iterative solvers. One important optimization for serial SpMMTV is reusing A-matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of SpMMTV that reuses A-matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of SpMMTV that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing A-matrix nonzeros and intermediate z-vector entries; exploiting locality in accessing x -, y -, and -vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of A. We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing A-matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods.
Open Access
Making peace with your multimedia
(1998) Adali, S.
[No abstract available]