HLS-based high-throughput and work-efficient synthesizable graph processing template pipeline

buir.contributor.authorAhangari, Hamzeh
buir.contributor.authorÖzdal, Muhammet Mustafa
buir.contributor.authorÖztürk, Özcan
buir.contributor.orcidAhangari, Hamzeh|0000-0001-9272-816X
dc.citation.epage34-24en_US
dc.citation.issueNumber2
dc.citation.spage34-1
dc.citation.volumeNumber22
dc.contributor.authorAhangari, Hamzeh
dc.contributor.authorÖzdal, Muhammet Mustafa
dc.contributor.authorÖztürk, Özcan
dc.contributor.editorMitra, Tulika
dc.date.accessioned2024-03-06T13:29:07Z
dc.date.available2024-03-06T13:29:07Z
dc.date.issued2024-01-24
dc.departmentDepartment of Computer Engineering
dc.description.abstractHardware systems composed of diverse execution resources are being deployed to cope with the complexity and performance requirements of Artificial Intelligence (AI) and Machine Learning (ML) applications. With the emergence of new hardware platforms, system-wide programming support has become much more important. While this is true for various devices ranging from CPUs to GPUs, it is especially critical for specific neural network accelerators implemented on FPGAs. For example, Intel’s recent HARP platform encompasses a Xeon CPU and an FPGA, which requires an intense software stack to be used effectively. Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem. However, as the abstraction level increases, the efficiency of implementation decreases, depicting two opposing requirements. In this work, we propose a framework to generate HLS-based, FPGA-accelerated, high-throughput/work-efficient, synthesizable, and template-based graph-processing pipeline. While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm by implementing/modifying only a single module in C/C++. This way, efficiency and high performance can be achieved with better programmability and productivity. With similar programming efforts, it is shown that the proposed template outperforms a high-throughput OpenCL baseline by up to 50% in terms of edge throughput. Furthermore, the novel work-efficient design significantly improves execution time and power consumption by up to 100×.
dc.description.provenanceMade available in DSpace on 2024-03-06T13:29:07Z (GMT). No. of bitstreams: 1 HLS-based_high-throughput_and_work-efficient_synthesizable_graph_processing_template_pipeline.pdf: 4766536 bytes, checksum: d1fd994d5e01d76fe7539198ffebf80e (MD5) Previous issue date: 2024-01-24en
dc.identifier.doi10.1145/3529256
dc.identifier.eissn1558-3465
dc.identifier.issn1539-9087
dc.identifier.urihttps://hdl.handle.net/11693/114363
dc.language.isoen
dc.publisherAssociation for Computing Machinery
dc.relation.isversionofhttps://dx.doi.org/10.1145/3529256
dc.source.titleACM Transactions on Embedded Computing Systems
dc.subjectComputer systems organization
dc.subjectReconfigurable computing
dc.subjectComputing
dc.subjectMethodologies
dc.subjectParallel programming languages
dc.titleHLS-based high-throughput and work-efficient synthesizable graph processing template pipeline
dc.typeArticle

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
HLS-based_high-throughput_and_work-efficient_synthesizable_graph_processing_template_pipeline.pdf
Size:
4.55 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.01 KB
Format:
Item-specific license agreed upon to submission
Description: