Browsing by Author "Öztürk, Özcan"

Now showing 1 - 20 of 50

Open Access
Adaptive compute-phase prediction and thread prioritization to mitigate memory access latency
(ACM, 2014-06) Aktürk, İsmail; Öztürk, Özcan
The full potential of chip multiprocessors remains unex- ploited due to the thread oblivious memory access sched- ulers used in off-chip main memory controllers. This is especially pronounced in embedded systems due to limita- Tions in memory. We propose an adaptive compute-phase prediction and thread prioritization algorithm for memory access scheduling for embedded chip multiprocessors. The proposed algorithm eficiently categorize threads based on execution characteristics and provides fine-grained priori- Tization that allows to differentiate threads and prioritize their memory access requests accordingly. The threads in compute phase are prioritized among the threads in mem- ory phase. Furthermore, the threads in compute phase are prioritized among themselves based on the potential of mak- ing more progress in their execution. Compared to the prior works First-Ready First-Come First-Serve (FR-FCFS) and Compute-phase Prediction with Writeback-Refresh Overlap (CP-WO), the proposed algorithm reduces the execution time of the generated workloads up to 23.6% and 12.9%, respectively. Copyright 2014 ACM.
Open Access
Adaptive prefetching for shared cache based chip multiprocessors
(IEEE, 2009-04) Kandemir, M.; Zhang, Y.; Öztürk, Özcan
Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts. © 2009 EDAA.
Open Access
Adaptive routing framework for network on chip architectures
(ACM, 2016-01) Mustafa, Naveed Ul; Öztürk, Özcan; Niar, S.
In this paper we suggest and demonstrate the idea of applying multiple routing algorithms during the execution of a real application mapped on a Network-on-Chip (NoC). Traffic pattern of a real application may change during its execution. As performance of an algorithm depends on the traffic pattern, using the same routing algorithm for the entire span of execution may be inefficient. We study the feasibility of this idea for applications such as SPARSE and MPEG-4 decoder, by applying different routing algorithms. By applying more than one routing algorithms, throughput improves up to 17.37% and 6.74% in the case of SPARSE and MPEG-4 decoder applications, respectively, as compared to the application of single routing algorithm. © 2016 ACM.
Open Access
Adaptive thread scheduling in chip multiprocessors
(Springer, 2019) Aktürk, İ.; Öztürk, Özcan
The full potential of chip multiprocessors remains unexploited due to architecture oblivious thread schedulers employed in operating systems. We introduce an adaptive cache-hierarchy-aware scheduler that tries to schedule threads in a way that inter-thread contention is minimized. A novel multi-metric scoring scheme is used which specifies L1 cache access characteristics of threads. Scheduling decisions are made based on these multi-metric scores of threads.
Open Access
Analysis of design parameters in SIL-4 safety-critical computer
(IEEE, 2017-01) Ahangari, Hamzeh; Özkök, Y. I.; Yıldırım, A.; Say, F.; Atik, Funda; Öztürk, Özcan
Nowadays, Safety-critical computers are extensively used in may civil domains like transportation including railways, avionics and automotive. We noticed that in design of some previous works, some critical safety design parameters like failure diagnostic coverage (DC) or common cause failure (CCF) ratio have not been seriously taken into account. Moreover, in some cases safety has not been compared with standard safety levels (IEC-61508 SIL1-SIL4) or even have not met them. Most often, it is not very clear that which part of the system is the Achilles' heel and how design can be improved to reach standard safety levels. Motivated by such design ambiguities, we aim to study the effect of various design parameters on safety in some prevalent safety configurations: 1oo2 and 2oo3. 1oo1 is also used as a reference. By employing Markov modeling, sensitivity of safety to each of the following critical design parameters is analyzed: failure rate of processing element, failure diagnostics coverage, common cause failures and repair rates. This study gives a deeper sense regarding influence of variation in design parameters over safety. Consequently, to meet appropriate safety integrity level, instead of improving some system parts blindly, it will be possible to make an informed decision on more relevant parameters. © 2017 IEEE.
Open Access
Architectural requirements for energy efficient execution of graph analytics applications
(IEEE, 2015-11) Özdal, Muhammet Mustafa; Yeşil, Şerif; Kim, T.; Ayupov, A.; Burns, S.; Öztürk, Özcan
Intelligent data analysis has become more important in the last decade especially because of the significant increase in the size and availability of data. In this paper, we focus on the common execution models and characteristics of iterative graph analytics applications. We show that the features that improve work efficiency can lead to significant overheads on existing systems. We identify the opportunities for custom hardware implementation, and outline the desired architectural features for energy efficient computation of graph analytics applications. © 2015 IEEE.
Open Access
Automatic selection of compiler optimizations by machine learning
(IEEE - Institute of Electrical and Electronics Engineers, 2023-08-28) Peker, Melih; Öztürk, Özcan; Yıldırım, S.; Uluyağmur Öztürk, M.
Many widely used telecommunications applications have extremely long run times. Therefore, faster and more efficient execution of these codes on the same hardware is important in critical telecommunication applications such as base stations. Compilers greatly affect the properties of the executable program to be created. It is possible to change properties such as compilation speed, execution time, power consumption and code size using compiler flags. This study aims to find the set of flags that will provide the shortest run time among hundreds of compiler flag combinations in GCC using code flow analysis, loop analysis and machine learning methods without running the program.
Open Access
AutopaR: An Automatic Parallelization Tool for Recursive Calls
(IEEE, 2014-09) Kalender, Mert Emin; Mergenci, Cem; Öztürk, Özcan
Manycore systems are becoming more and more powerful with the integration of hundreds of cores on a single chip. However, writing parallel programs on these manycore systems has become a problem since the amount of available parallel tools and applications are limited. Although exploiting parallelism in software is possible, it requires different design decisions, significant programmer effort and is error prone. Different libraries and tools try to make the transition to parallelism easier, however there is no concrete system to make it transparent to software developer. To this end, our proposed tool is a step forward to improve the current state. Our approach, Autopar, specifically aims at achieving automatic parallelization of recursive applications using static program analysis. It first decides on the recursive functions of a given program. Then, it performs analysis and collects information about these recursive functions. Our analysis module automatically collects program information without requiring any modification in the program design or developer involvement. Finally, it achieves automatic parallelization by introducing necessary OpenMP pragmas in appropriate places in the application. © 2014 IEEE.
Open Access
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
(ACM, 2016- 05) Soltaniyeh, M.; Kadayıf, I.; Öztürk, Özcan
Chip multiprocessors (CMPs) require effective cache coher-ence protocols as well as fast virtual-To-physical address trans-lation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-The-Art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory struc-ture. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect consid-erably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the di-rectory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory struc-ture to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) in-volvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some por-tion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution. © 2016 Copyright held by the owner/author(s).
Open Access
Cache hierarchy-aware query mapping on emerging multicore architectures
(IEEE, 2017) Öztürk, Özcan; Orhan, U.; Ding, W.; Yedlapalli, P.; Kandemir, M. T.
One of the important characteristics of emerging multicores/manycores is the existence of 'shared on-chip caches,' through which different threads/processes can share data (help each other) or displace each other's data (hurt each other). Most of current commercial multicore systems on the market have on-chip cache hierarchies with multiple layers (typically, in the form of L1, L2 and L3, the last two being either fully or partially shared). In the context of database workloads, exploiting full potential of these caches can be critical. Motivated by this observation, our main contribution in this work is to present and experimentally evaluate a cache hierarchy-aware query mapping scheme targeting workloads that consist of batch queries to be executed on emerging multicores. Our proposed scheme distributes a given batch of queries across the cores of a target multicore architecture based on the affinity relations among the queries. The primary goal behind this scheme is to maximize the utilization of the underlying on-chip cache hierarchy while keeping the load nearly balanced across domain affinities. Each domain affinity in this context corresponds to a cache structure bounded by a particular level of the cache hierarchy. A graph partitioning-based method is employed to distribute queries across cores, and an integer linear programming (ILP) formulation is used to address locality and load balancing concerns. We evaluate our scheme using the TPC-H benchmarks on an Intel Xeon based multicore. Our solution achieves up to 25 percent improvement in individual query execution times and 15-19 percent improvement in throughput over the default Linux-based process scheduler. © 1968-2012 IEEE.
Open Access
A cache topology-aware multi-query scheduler for multicore architectures
(IEEE, 2014) Orhan, U.; Ding, W.; Yedlapalli, P.; Kandemir, M.; Öztürk, Özcan
Growing performance gap between processors and main memory has made it worthwhile to consider off-chip data accesses in multi-query processing [2], [1], [3]. Exploiting data-sharing opportunities among concurrent queries can be critical for effective utilization of the underlying shared memory hierarchy. Given a set of queries, there may be a common retrieval operation for several cases to the same data. A query can benefit from the data previously loaded into the shared cache/memory space by another query. However, if these queries are scheduled independently, it is very likely that the same data is brought from off-chip memory to on-chip caches multiple times, thereby consuming off-chip bandwidth and slowing down overall execution.
Open Access
Çizge uygulamalarına özel işlemci tasarımı
(IEEE, 2022-08-29) Pulat, Gülce; Saeed, Aamir; Yenimol, Mehmetali Semi; Gülgeç, Utku; Öztürk, Özcan
Bir çok büyük veri işleme uygulaması “PageRank”, “İşbirliğine Dayalı Filtreleme” ve “Betweenness Centrality” gibi düzensiz ve yinelemeli çizge algoritmalarından yararlanmaktadır. Çizge yapılarında her bir düğüm bir kişiye ya da nesneye karşılık gelirken, her bir ayrıt da bir kişi ya da nesne çifti arasındaki ilişkiye karşılık gelmektedir. Böylesi büyük verileri işleyen algoritmaları yürütecek genel amaçlı işlemciler yetersiz kalabilmektedir. Özellikle, güç kısıtları sebebiyle işlemci alt parçalarından sadece bir kısmı aynı anda aktif olarak kullanılabilmektedir. Bu da işlemcinin etkin olarak kullanılmasına engel olarak aktif olmayan “karanlık silikon”lar ortaya çıkarmaktadır. Bu bildirinin amacı da büyük verili ve düzensiz çizge uygulamalarını hızlı, verimli ve kolay programlanabilir biçimde çalıştıracak bir işlemci mimarisi tasarlamaktır. Literatürde çizge uygulamaları için daha önce sunulmuş çalışmalar çoğunlukla sadece yazılım ya da hızlandırıcı seviyesindedir. İşlemci seviyesi tasarımlar ise donanımsal maliyet, talep ettikleri mimari destek ve komut seti değişiklikleri açılarından bu bildiri kapsamında tanıtılacak işlemciden farklıdır.
Open Access
Classifying data blocks at subpage granularity with an on-chip page table to improve coherence in tiled CMPs
(Institute of Electrical and Electronics Engineers, 2018) Soltaniyeh, M.; Kadayif, I.; Öztürk, Özcan
As shown in some prior studies, a significant percentage of data blocks accessed in parallel codes are private, and not keeping track of those blocks can improve the effectiveness of directory structures in Chip multiprocessors (CMPs). In this paper, we have two major contributions. First, we showed that compared to the classification of cache blocks at page granularity, data block classification (DBC) at subpage level helps to detect considerably more private data blocks. Based on this idea, we propose two different approaches for enhancing the effectiveness of directory caches in tiled CMPs. In the first approach, which is called quasi-dynamic subpage level DBC (QDBC), a data block is assumed to be private from the beginning of the program execution and stays private as long as the corresponding subpage is accessed by only one core. Our second approach, which is called dynamic subpage level DBC, turns a data block into private again after all blocks within the corresponding subpage are evicted from private cache hierarchy. Memory block classification at subpage level, however, may increase the frequency of the operating system involvement in updating the maintenance bits in page table entries. To overcome this, we propose, as a second contribution, a distributed table called as on-chip page table (o-CPT), which stores recently accessed page translations in the system. Our simulation results show that, compared to page level data classification, QDBC and DBC approaches relying on the o-CPT can detect significantly more private data blocks and considerably improve system performance.
Open Access
Code scheduling for optimizing parallelism and data locality
(Springer, 2010-08-09) Yemliha, T.; Kandemir, M.; Öztürk, Özcan; Kultursay, E.; Muralidhara, S. P.
As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag.
Open Access
Compiler-supported selective software fault tolerance
(IEEE, 2023-11-15) Turhan, Tuncer; Tekgul, H.; Öztürk, Özcan
As technology advances, the processors are shrunk in size and manufactured using higher-density transistors, making them cheaper, more power efficient, and more powerful. While this progress is most beneficial to end-users, these advances make processors more vulnerable to outside radiation, causing soft errors, mostly in single-bit flips on data. In applications where a certain margin of error is acceptable, and availability is important, the existing software fault tolerance techniques may not be applied directly because of the unacceptable performance overheads they introduce to the system. We propose a technique that ranks the instructions in terms of their criticality and generates a more reliable source code. This way, we improve reliability and minimize the performance overheads.
Open Access
Dynamic thread and data mapping for NoC based CMPs
(IEEE, 2009-07) Kandemir, M.; Öztürk, Özcan; Muralidhara, S. P.
Thread mapping and data mapping are two important problems in the context of NoC (network-on-chip) based CMPs (chip multiprocessors). While a compiler can determine suitable mappings for data and threads, such static mappings may not work well for multithreaded applications that go through different execution phases during their execution, each phase with potentially different data access patterns than others. Instead, a dynamic mapping strategy, if its overheads can be kept low, may be a more promising option. In this work, we present dynamic (runtime) thread and data mappings for NoC based CMPs. The goal of these mappings is to reduce the distance between the location of the core that requests data and the core whose local memory contains that requested data. In our experiments, we evaluate our proposed thread mapping and data mapping in isolation as well as in an integrated manner. Copyright 2009 ACM.
Open Access
Effective kernel mapping for OpenCL applications in heterogeneous platforms
(Institute of Electrical and Electronics Engineers, 2012-09) Albayrak, Ömer Erdil; Aktürk, İsmail; Öztürk, Özcan
Many core accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. In this paper, we propose a novel profiling-based kernel mapping algorithm to assign each kernel of an application to the proper device to improve the overall performance of an application. We use profiling information of kernels on different devices and generate a map that identifies which kernel should run on where to improve the overall performance of an application. Initial experiments show that our approach can effectively map kernels on CPU and GPU, and outperforms to a CPU-only and GPU-only approach. © 2012 IEEE.
Open Access
Energy efficient architecture for graph analytics accelerators
(IEEE, 2016-06) Özdal, Muhammet Mustafa; Yeşil, Şerif; Kim, T.; Ayupov, A.; Greth, J.; Burns, S.; Öztürk, Özcan
Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area. © 2016 IEEE.
Open Access
Exploiting architectural features of a computer vision platform towards reducing memory stalls
(Springer, 2020) Mustafa, Naveed Ul; O’Riordan, M. J.; Rogers, S.; Öztürk, Özcan
Computer vision applications are becoming more and more popular in embedded systems such as drones, robots, tablets, and mobile devices. These applications are both compute and memory intensive, with memory bound stalls (MBS) making a significant part of their execution time. For maximum reduction in memory stalls, compilers need to consider architectural details of a platform and utilize its hardware components efficiently. In this paper, we propose a compiler optimization for a vision-processing system through classification of memory references to reduce MBS. As the proposed optimization is based on the architectural features of a specific platform, i.e., Myriad 2, it can only be applied to other platforms having similar architectural features. The optimization consists of two steps: affinity analysis and affinity-aware instruction scheduling. We suggest two different approaches for affinity analysis, i.e., source code annotation and automated analysis. We use LLVM compiler infrastructure for implementation of the proposed optimization. Application of annotation-based approach on a memory-intensive program shows a reduction in stall cycles by 67.44%, leading to 25.61% improvement in execution time. We use 11 different image-processing benchmarks for evaluation of automated analysis approach. Experimental results show that classification of memory references reduces stall cycles, on average, by 69.83%. As all benchmarks are both compute and memory intensive, we achieve improvement in execution time by up to 30%, with a modest average of 5.79%.
Open Access
Fault-tolerant irregular topology design method for network-on-chips
(IEEE, 2014) Tosun, S.; Ajabshir V.B.; Mercanoglu O.; Öztürk, Özcan
As the technology sizes of integrated circuits (ICs) scale down rapidly, current transistor densities on chips dramatically increase. While nanometer feature sizes allow denser chip designs in each technology generation, fabricated ICs become more susceptible to wear-outs, causing operation failure. Even a single link failure within an on-chip fabric can halt communication between application blocks, which makes the entire chip useless. In this study, we aim to make faulty chips designed with Network-on-Chip (NoC) communication usable. Specifically, we present a fault-tolerant irregular topology generation method for application specific NoC designs. Designed NoC topology allows a different routing path if there is a link failure on the default routing. We compare fault-tolerant topologies with regular fault-tolerant ring topologies, and non-fault-tolerant application specific irregular topologies on energy consumption, performance, and area using multimedia benchmarks and custom-generated graphs. © 2014 IEEE.