SSFT: selective software fault tolerance

Date

2014

Editor(s)

Advisor

Öztürk, Özcan

Supervisor

Co-Advisor

Co-Supervisor

Instructor

Source Title

Print ISSN

Electronic ISSN

Publisher

Volume

Issue

Pages

Language

English

Type

Thesis

Journal Title

Journal ISSN

Volume Title

Attention Stats
Usage Stats
4
views
20
downloads

Series

Abstract

As technology advances, the processors are shrunk in size and manufactured using higher density transistors which makes them cheaper, more power efficient and more powerful. While this progress is most beneficial to end-users, these advances make processors more vulnerable to outside radiation causing soft errors which occur mostly in the form of single bit flips on data. For protection against soft errors, hardware techniques like ECC (Error Correcting Code) and Ram Parity Memory are proposed to provide error detection and even error correction capabilities. While hardware techniques provide effective solutions, software only techniques may offer cheaper and more flexible alternatives where additional hardware is not available or cannot be introduced to existing architectures. Software fault detection techniques -while powerful- rely mostly on redundancy which causes significant amount of performance overhead and increase in the number of bits susceptible to soft errors. In most cases, where reliability is a concern, the availability and performance of the system is even a bigger concern, which actually requires a multi objective optimization approach. In applications where a certain margin of error is acceptable and availability is important, the existing software fault tolerance techniques may not be applied directly because of the unacceptable performance overheads they introduce to the system. Our technique Selective Software Fault Tolerance (SSFT) aims at providing availability and reliability simultaneously, by providing only required amount of protection while preserving the quality of the program output. SSFT uses software profiling information to understand application’s vulnerabilities against transient faults. Transient faults are more likely to occur in instructions that have higher execution counts. Additionally, the instructions that cause greater damage in program output when hit by transient faults, should be considered as application weaknesses in terms of reliability. SSFT combines these information to eliminate the instructions from fault tolerance, that are less likely to be hit by transient errors or cause errors in program output. This approach reduces power consumption and redundancy (therefore less data bits susceptible to soft errors), while improving performance and providing acceptable reliability. This technique can easily be adapted to existing software fault tolerance techniques in order to achieve a more suitable form of protection that will satisfy different concerns of the application. Similarly, hybrid and hardware only approaches may also take advantage of the optimizations provided by our technique.

Course

Other identifiers

Book Title

Keywords

Software Fault Tolerance, Multi objective optimization: Reliability and Availability, Reliability, Software Profiling for Reliability, Software Fault Injection

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Citation

Published Version (Please cite this version)