When using a vendor SDK or HAL, always check how callbacks are being run

At my job, I often work with embedded devices. Once, we had a particularly nasty problem with measurements from sensors sometimes returning unexpected values, debug logs getting garbled up, and then there was the occasional crash, too.

Most of our code was unit tested on a desktop. The tests were run with sanitizers, and they reported no issues.

Still, the problems kind of pointed to some kind of memory corruption. A threading issue, perhaps?

But the code was being run on a bare metal ARM Cortex-M4, which was single-threaded. No RTOS. How could there be a threading issue?

Well. Even if it’s single-threaded, it still has interrupts. And interrupts are essentially threads that have higher priority than your application code, so whenever they happen, the application code is paused and the interrupt code is ran instead.

It turned out that we were passing callbacks to the hardware vendor’s SDK that were being run directly inside an interrupt handler. That was not documented anywhere; I just stumbled upon the fact by reading the SDK’s source code.

To fix this, I just moved any problematic code out of the interrupt context, and that was it.

From now on, I’m going to be paying a lot more attention when passing callbacks to a vendor’s code.

PS. As the usual fix for something like this is to store any relevant data from the interrupt and set a flag, then check that flag and read the data from the application code later, it must be noted that it’s not enough to slap volatile on the flag and call it a day as volatile does not prevent reordering relative to other non-volatile accesses or caching done by the CPU. Use atomics or a memory barrier.

Tags: