SSFT: selective software fault tolerance
Author
Turhan, Tuncer
Advisor
Öztürk, Özcan
Date
2014Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
70
views
views
23
downloads
downloads
Abstract
As technology advances, the processors are shrunk in size and manufactured
using higher density transistors which makes them cheaper, more power efficient
and more powerful. While this progress is most beneficial to end-users, these advances
make processors more vulnerable to outside radiation causing soft errors
which occur mostly in the form of single bit flips on data. For protection against
soft errors, hardware techniques like ECC (Error Correcting Code) and Ram
Parity Memory are proposed to provide error detection and even error correction
capabilities. While hardware techniques provide effective solutions, software
only techniques may offer cheaper and more flexible alternatives where additional
hardware is not available or cannot be introduced to existing architectures. Software
fault detection techniques -while powerful- rely mostly on redundancy which
causes significant amount of performance overhead and increase in the number
of bits susceptible to soft errors. In most cases, where reliability is a concern,
the availability and performance of the system is even a bigger concern, which
actually requires a multi objective optimization approach. In applications where
a certain margin of error is acceptable and availability is important, the existing
software fault tolerance techniques may not be applied directly because of the
unacceptable performance overheads they introduce to the system. Our technique
Selective Software Fault Tolerance (SSFT) aims at providing availability
and reliability simultaneously, by providing only required amount of protection
while preserving the quality of the program output. SSFT uses software profiling
information to understand application’s vulnerabilities against transient faults.
Transient faults are more likely to occur in instructions that have higher execution
counts. Additionally, the instructions that cause greater damage in program
output when hit by transient faults, should be considered as application weaknesses
in terms of reliability. SSFT combines these information to eliminate the
instructions from fault tolerance, that are less likely to be hit by transient errors
or cause errors in program output. This approach reduces power consumption
and redundancy (therefore less data bits susceptible to soft errors), while improving
performance and providing acceptable reliability. This technique can easily be
adapted to existing software fault tolerance techniques in order to achieve a more
suitable form of protection that will satisfy different concerns of the application.
Similarly, hybrid and hardware only approaches may also take advantage of the
optimizations provided by our technique.
Keywords
Software Fault ToleranceMulti objective optimization: Reliability and Availability
Reliability
Software Profiling for Reliability
Software Fault Injection