HLS-based high-throughput and work-efficient synthesizable graph processing template pipeline

Ahangari, Hamzeh; Özdal, Muhammet Mustafa; Öztürk, Özcan

HLS-based high-throughput and work-efficient synthesizable graph processing template pipeline

buir.contributor.author	Ahangari, Hamzeh
buir.contributor.author	Özdal, Muhammet Mustafa
buir.contributor.author	Öztürk, Özcan
buir.contributor.orcid	Ahangari, Hamzeh\|0000-0001-9272-816X
dc.citation.epage	34-24	en_US
dc.citation.issueNumber	2
dc.citation.spage	34-1
dc.citation.volumeNumber	22
dc.contributor.author	Ahangari, Hamzeh
dc.contributor.author	Özdal, Muhammet Mustafa
dc.contributor.author	Öztürk, Özcan
dc.contributor.editor	Mitra, Tulika
dc.date.accessioned	2024-03-06T13:29:07Z
dc.date.available	2024-03-06T13:29:07Z
dc.date.issued	2024-01-24
dc.department	Department of Computer Engineering
dc.description.abstract	Hardware systems composed of diverse execution resources are being deployed to cope with the complexity and performance requirements of Artificial Intelligence (AI) and Machine Learning (ML) applications. With the emergence of new hardware platforms, system-wide programming support has become much more important. While this is true for various devices ranging from CPUs to GPUs, it is especially critical for specific neural network accelerators implemented on FPGAs. For example, Intel’s recent HARP platform encompasses a Xeon CPU and an FPGA, which requires an intense software stack to be used effectively. Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem. However, as the abstraction level increases, the efficiency of implementation decreases, depicting two opposing requirements. In this work, we propose a framework to generate HLS-based, FPGA-accelerated, high-throughput/work-efficient, synthesizable, and template-based graph-processing pipeline. While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm by implementing/modifying only a single module in C/C++. This way, efficiency and high performance can be achieved with better programmability and productivity. With similar programming efforts, it is shown that the proposed template outperforms a high-throughput OpenCL baseline by up to 50% in terms of edge throughput. Furthermore, the novel work-efficient design significantly improves execution time and power consumption by up to 100×.
dc.identifier.doi	10.1145/3529256	en_US
dc.identifier.eissn	1558-3465	en_US
dc.identifier.issn	1539-9087	en_US
dc.identifier.uri	https://hdl.handle.net/11693/114363	en_US
dc.language.iso	English	en_US
dc.publisher	Association for Computing Machinery	en_US
dc.relation.isversionof	https://dx.doi.org/10.1145/3529256
dc.source.title	ACM Transactions on Embedded Computing Systems
dc.subject	Computer systems organization
dc.subject	Reconfigurable computing
dc.subject	Computing
dc.subject	Methodologies
dc.subject	Parallel programming languages
dc.title	HLS-based high-throughput and work-efficient synthesizable graph processing template pipeline
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: HLS-based_high-throughput_and_work-efficient_synthesizable_graph_processing_template_pipeline.pdf
Size:: 4.55 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.01 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Scholarly Publications - Computer Engineering