CAPRI6: An ASIC for fault root-causing

Capri6 Prototype

Root-cause analysis of faults at the level of software is difficult because of the large number of abstraction levels that obfuscate the root cause of the fault effect. In particular, after a fault injection, the hardware behavior transforms from a sane machine into a weird machine, potentially changing the meaning of the software or even the instructions. The fault effect therefore becomes hard to understand, as the behavior of the hardware is no longer consistent with the original design. We developed an ASIC to help us study this problem and to define a methodology towards better root causing. The fault root causing methodology has applications in testing cryptographic protocols, fault countermeasures and potentially the hard problem of silent data corruption.

The problem of tracking the Weird Machine

In the experimental work that supports fault attack assessments, there are two schools of practice commonly found. In the first case, a designer uses simulation to trace the effect of a fault and decide, for example, if the fault results in information leakage or privilege escalation. In the second case, a designer uses physical fault injection on a prototype by means of clock/voltage glitching, EM injection or laser injection. From these experiments the designer then tries to deduce if information leakage can occur. But both these techniques fail to address the problem of fault attack assessment in depth. In the case of simulation, designers are forced to choose a fault model (such as random bit flip) which may or may not occur in practice. Hence, the assessment is done on the simulation under an artificial fault model, rather than on the actual artifact under an actual fault stressor. In the case of experimental fault attack, the observability of the fault response is limited to cases such as no faulty output, faulty output and no response. As a result, the fault assessment is probabilistic and fails to explain the true impact of the fault stressor in the architecture under test.

Recent research into the deeper nature of fault injection response of digital hardware has led to the definition of the weird machine (cfr Thomas Dullien, 2020). When software executes on non-faulty hardware, the resulting set of hardware states make up the sane machine: the software runs under the correct implementation of the instruction-set architecture. After a fault injection, it’s possible that the hardware reaches a state that may be unreachable from the sane machine. The weird machine is gigantic and unknown, and the fault response cannot be explained from software alone. For example, a bit flip could change an instruction operand, an instruction opcode, or even the meaning of an instruction opcode. Yet, understanding the rules of the weird machine is essential to correctly evaluate the fault response of a design under test. For example, if a software program shows no faulty output after a fault injection, does that really mean that no future execution of the software could suddenly turn out a faulty output? Or that will crash the program?

CAPRI6: A chip that can track the Weird Machine

Capri6 Floorplan

CAPRI6 is a chip that aims to resolve the difficulty of tracking the weird machine. The chip contains a network of six identically configured MSP-430 microcontrollers that execute a software program in lockstep. Through an on-chip communication network, they verify in a distributed manner that all copies of a test program are correctly executing. Each core can raise a redundantly implemented exception, which will halt execution of all cores. We then scan out the full ASIC state through a scan chain. The key insight of CAPRI6 is to explain the root cause of a low-level hardware fault by redundantly detecting the fault in software and analyzing the post-fault hardware state at regular intervals between fault injection and fault detection. CAPRI6 is realized in a TSMC 180nm standard cell technology with on-chip memory, full-scan and an I2C programming interface. CAPRI6 also contains on-chip sensors to monitor the electrical integrity of the chip.

The following figure is an example visualization of fault propagation into the flip-flops of the CAPRI6 cores. Clearly, not every core is affected in the same manner of a clock glitch. Some cores, such as the upper right core, fails early and disastrously, while others such as the upper right core, redundantly detects the fault and halts the systems’ execution.

Capri6 Fault Analysis

Further Reading

doi Z. Liu, D. Shanmugam and P. Schaumont "FaultDetective: Explainable to a Fault, from the Design Layout to the Software," IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024(4).