SSFT: selective software fault tolerance
Please cite this item using this persistent URLhttp://hdl.handle.net/11693/30002
As technology advances, the processors are shrunk in size and manufactured using higher density transistors which makes them cheaper, more power efficient and more powerful. While this progress is most beneficial to end-users, these advances make processors more vulnerable to outside radiation causing soft errors which occur mostly in the form of single bit flips on data. For protection against soft errors, hardware techniques like ECC (Error Correcting Code) and Ram Parity Memory are proposed to provide error detection and even error correction capabilities. While hardware techniques provide effective solutions, software only techniques may offer cheaper and more flexible alternatives where additional hardware is not available or cannot be introduced to existing architectures. Software fault detection techniques -while powerful- rely mostly on redundancy which causes significant amount of performance overhead and increase in the number of bits susceptible to soft errors. In most cases, where reliability is a concern, the availability and performance of the system is even a bigger concern, which actually requires a multi objective optimization approach. In applications where a certain margin of error is acceptable and availability is important, the existing software fault tolerance techniques may not be applied directly because of the unacceptable performance overheads they introduce to the system. Our technique Selective Software Fault Tolerance (SSFT) aims at providing availability and reliability simultaneously, by providing only required amount of protection while preserving the quality of the program output. SSFT uses software profiling information to understand application’s vulnerabilities against transient faults. Transient faults are more likely to occur in instructions that have higher execution counts. Additionally, the instructions that cause greater damage in program output when hit by transient faults, should be considered as application weaknesses in terms of reliability. SSFT combines these information to eliminate the instructions from fault tolerance, that are less likely to be hit by transient errors or cause errors in program output. This approach reduces power consumption and redundancy (therefore less data bits susceptible to soft errors), while improving performance and providing acceptable reliability. This technique can easily be adapted to existing software fault tolerance techniques in order to achieve a more suitable form of protection that will satisfy different concerns of the application. Similarly, hybrid and hardware only approaches may also take advantage of the optimizations provided by our technique.