Custom hardware optimizations for reliable and high performance computer architectures
In recent years, we have witnessed a huge wave of innovations, such as in Artificial Intelligence (AI) and Internet-of-Things (IoT). In this trend, software tools are constantly and increasingly demanding more processing power, which can no longer be met by processors traditionally. In response to this need, a diverse range of hardware, including GPUs, FPGAs, and AI accelerators, are coming to the market every day. On the other hand, while hardware platforms are becoming more power-hungry due to higher performance demand, concurrent reduction in the size of transistors, and placing high emphasis on reducing the voltage, altogether have always been sources of reliability concerns in circuits. This particularly is applicable to error-sensitive applications, such as transportation and aviation industries where an error can be catastrophic. The reliability issues may have other reasons too, like harsh environmental conditions. These two problems of modern electronic circuits, meaning the need for higher performance and reliability at the same time, require appropriate solutions. In order to satisfy both the performance and the reliability constraints either designs based on reconfigurable circuits, such as FPGAs, or designs based on Commercial-Off-The-Shelf (COTS) components like general-purpose processors, can be an appropriate approach because the platforms can be used in a wide variety of applications. In this regard, three solutions have been proposed in this thesis. These solutions target
- safety and reliability at the system-level using redundant processors, 2) performance at the architecture-level using multiple accelerators, and 3) reliability at the circuit-level through the use of redundant transistors. Specifically, in the first work, the contribution of some prevalent parameters in the design of safetycritical computers, using COTS processors, is discussed. Redundant architectures are modeled by the Markov chains, and sensitivity of system safety to parameters has been analyzed. Most importantly, the significant presence of Common Cause Failures (CCFs) has been investigated. In the second work, the design, and implementation of an HLS-based, FPGA-accelerated, high-throughput/work-efficient, synthesizable template-based graph processing framework has been presented. The template framework is simplified for easy mapping to FPGA, even for software programmers. The framework is particularly experimented on Intel state-ofthe-art Xeon+FPGA platform to implement iterative graph algorithms. Beside high-throughput pipeline, work-efficient mode significantly reduces total graph processing run-time with a novel active-list design. In the third work, Joint SRAM (JSRAM) cell, a novel circuit-level technique to exploit the trade-off between reliability and memory size, is introduced. This idea is applicable to any SRAM structure like cache memory, register file, FPGA block RAM, or FPGA look-up table (LUT), and even latches and Flip-Flops. In fault-prone conditions, the structure can be configured in such a way that four cells are combined together at the circuit level to form one large and robust memory bit. Unlike prevalent hardware redundancy techniques, like Triple Modular Redundancy (TMR), there is no explicit majority voter at the output. The proposed solution mainly focuses on transient faults, where the reliable mode can provide auto-correction and full immunity against single faults.