Error recovery in embedded systems
Tue Jan 22, 2019 · 839 words

Considering a typical single-purpose, embedded system, likely running a microcontroller:

What happens when we detect a fault condition?

Live recovery

Live recovery is attempting to handle recovery using normal program control flow (i.e. without resets). This typically leads to deep highly nested code flow as eachkm

Reset recovery.

Recovery using the reset mechanism to clear bad state. This is performed by the PANIC call. Even though we are resetting, we must not lose track of the fact that we had an unhandled condition. This could be logged relatively easily by

Why is this a better option?

What does our PANIC handler look like?

void PANIC( const char* message, 
            uint32_t messageLength)
{
    // Don't let anything else run.
    disableInterrupts();

    // Log message to be output at startup
    memcpy(&noInitMsg, message, messageLength);

    // Don't allow the compiler to optimise out 
    // the return code.
    volatile bool   resumeFlag  = false;

    // Sit here waiting for user attention.
    // - In release mode, the watchdog will kick 
    // in here.
    // - In debug mode, we can attach with the 
    // debugger.
    while( resumeFlag == false )
    {
        flashLED();
    }
}

The intention is that the above code will be called from any assert failures or exception handlers where live-recovery is not an option.

How do we persist data across a reset?

Benefits

Normalisation of the reset path seems antithetical to good software development practice, but in fact it is not. There are numerous benefits:

What we’re saying is that we treat reset as a normal part of program control flow. We can perform resets as a result of:

How do we normalise the reset-path?

Treating reset as a normal code-path means making everything robust. This particularly means:

This is not extra work that you have to do. This work was always required but was ignored due to the reset-path being unused (most of the time).

Debugging a fault

Effect on the codebase

Live recovery

bool myFoo(int state)
{
    if(state > 0)
    {
        if(state < 100)
        {
            if(theData != NULL)
            {
                if(theCallback != NULL)
                {
                    theCallback( theData );
                    return true;
                }
                else
                {
                    maybeCallADefault();
                    return false;
                }
            }
            else
            {
                tryAndRecover();
                return false;
            }
        }
        else
        {
            someRecoveryAction();
            return false;
        }
    }
    else
    {
        anotherRecoveryAction();
        return false;
    }
}

Reset-recovery

void myFoo(int state)
{
    PANIC_IF( state<0, "state out-of-range!" );
    PANIC_IF( state>=100, "input out-of-range" );
    PANIC_IF( theData == NULL, "data not set!" );
    PANIC_IF( theCallback == NULL, "fn not set");

    theCallback( theData );
}

How do we make this testable?

We need to be able to test these paths, so how can we do this?

Reset is your friend!


back · Articles · Who am I? ·