FPGA development is not just about writing an HDL code, but understanding the particular FPGA technology, interaction of synthesis/placing tool with the HDL program, and the hardware as well.
In this article, we will be talking about the use of FPGAs in mission/safety critical designs, issues, and mitigation techniques.
Traditional FPGA Design Flow
Behavioural Simulation
In the traditional FPGA design flow, the first step after writing an HDL code is to simulate the design and correct any errors in the behaviour. This step is no doubt of great importance as the logical functionality of the algorithm can be tested and corrected at this point.
Test bench can not only test the regular functionality but also helps in identifying & resolving the corner cases that can not be easily tested in the formal testing procedure.
Hence, it is worth spending some decent time on writing test benches and verifying the behaviour of the design.
Design verification & test benching is now even possible with high-level languages like python or C. I will cover this in one of the upcoming articles.
Synthesis
The next step is to convert the HDL code into gate-level representation, the synthesizer performs this job. This step is not as simple as it looks, as a part of the process based on the configuration, the synthesizer optimizes the design for time and area, extracts Finite State Machines (FSMs), and re-implements the FSMs by changing encoding schemes (e.g. one-hot), removing duplicate registers & instances with unused outputs and removing the non-synthesizable constructs. During this process, the nets might collapse and hierarchies dissolved or reconstructed.
This whole process might change the functionality of the design (the chances increase when the code is not hardware aware).
Functional Verification
Cutting it short, in this step, the synthesized netlist is simulated to check if the design behaves as it should, if there are failures, it's time to think, analyze and go back to change the synthesis switches or HDL logic.
Design Implementation
In this stage, the synthesized netlist based on the inputs in the constraint files goes through different processes including logic optimization, power optimization, mapping the netlist to the technology-specific elements, and finally routing the implemented design. This is an iterative process and can take time depending on the logic and constraints as well.
Post Implementation Timing & Design Simulation
Design Simulation
After the design is implemented the final design is simulated once again. One of the benefits of debugging the post-implemented design is that the tool has access to a timing-accurate model for the design.
Post-implementation design simulation will generally have longer run times and the results may vary depending on the system model accuracy.
Timing Simulation
Post-implemented timing simulation ensures that the implemented design meets functional and timing requirements and has the expected behavior in the device. Timing simulation is the closest emulation to actually downloading a design to a device.
Post-implementation simulations make sure that the design's timings have been closed and its functional working has not been modified by the constraints or synthesis properties. It can also help identify the issues in the timing constraints or switches.
The next steps include the bitstream generation for the particular FPGA and finally verification of the design on the chip (not all the cases can be covered)
Safety Critical FPGA Design Flow
Safety-critical design involves detecting faults and either correcting them or taking appropriate measures such as a safe system shutdown or reset. Some faults may break the connections, damage the chip physically, make a signal stuck at zero or one, or change the values of components, these are classified to be permanent faults, the best way to ensure safety is to safely shut down the system if recovery is not possible. However, the faults like bit flips or glitches in the signals or transients are classified as temporary failures and are easy to correct.
In a mission/safety critical FPGA design the steps are somewhat similar to the traditional design but some special procedures are taken to meet the safety criteria and to fix the introduced errors.
Hereafter we will discuss the steps we can take for safety-critical designs
Maintain Critical Logic
During the synthesis process, the tool by default optimizes the design for area and power. This process might remove the duplicate registers, and redundant logic created for the reliability proposes. We can use the synthesizer switches to control the level of optimization and can even turn off the optimization on certain code blocks e.g. error correction circuits, and redundant registers.
There are tools available that take the Golden & Revised designs as the input and check if the two logics generate the same output or not. These tools can be effectively used to check the effect of synthesis attributes on the algorithm.
Safe FSMs
State machine compilers are included in most modern synthesizers which extract the FSMs from the logic and optimize them which include changing the encoding scheme, and removing unreachable states, which can be a threat to safety. The developer should use the attributes to turn off all the FSM optimization, maintain the tested logic, and decide the optimal encoding scheme manually.
Triple Mode Redundancy (TMR)
In TMR the tool automatically creates three instances of the state machine, places them at a different location in the chip, and adds a voting logic, the states are voted and compared on each cycle. This is a very important concept, the system will remain operational in case any one of the copies gets corrupted.
Hamming Three Encoding
This approach is suitable for handling the failures related to single-bit flips. The state machines are encoded with a hamming distance of three between them, the adjacent states implement the same functionality. Hence, the system will keep on operating normally if a single-bit flip error is introduced.
Hamming Two Encoding
In this approach the states of FSM are encoded with the hamming distance of two between them, in this case, a single-bit flip can be easily accounted for by using a logical XOR operation on the states and appropriate actions can be taken to keep the system safe.
Default Case
The FSMs should contain the default case which handles the case if the registers get corrupted to an illegal value. However, it should be made sure that the tool does not remove this logic.
Isolating Algorithms
The algorithm can be isolated from any design with which it does not communicate. i.e. give the algorithm its own environment and define it with the algorithm. In this way, the change in any other block will not hurt your design.
In-Circuit Verification
A special HDL logic (tested & vetted) can be designed in FPGA for testing the critical circuits, this logic can be a simple pre-defined test on the bootup or can be a serial interface that takes input from any serial terminal for the execution of user-defined tests.
Clock Domain Synchronization
After converting the requirements to RTL logic, a developer should make sure that the clock domains are properly handled. Ignoring the clock domain crossings can be catastrophic, more details can be found in the previous article about Metastability & Clock Domains in FPGA.
The techniques discussed above for Safe State Machines increase the QoR but require more effort and/or area when compared to the traditional approach, as nothing is free in this world, and if we talk about safety that is definitely costly.
Feel free to reach out:
LinkedIn: linkedin.com/in/muhammad-hamza-muneer-39663..
e-Mail: hamzamuneer95@yahoo.com