Did cosmic rays break my Linux build?
01 Sep 2019I think I experienced a random bit flip while updating Linux on one of my machines today. My laptop was humming along happily during compilation until GCC suddenly aborted with an error: invalid preprocessing directive #lefine; did you mean #define?.
Huh, #lefine
instead of #define
?
Now, GCC is known for its cryptic error messages, but this one left me baffled. Things I tried to figure out what the issue could be:
- I first grepped the offending file,
drivers/gpu/drm/i915/i915_reg.h
, for#lefine
to see if a typo somehow slipped through the Linux 4.19.1 release, but got 0 matches. - When I resumed the build, it completed without errors. The kernel also appeared to run fine upon reboot. I ended up recompiling it entirely multiple times to see if I could reproduce the error, but I couldn’t.
- I ran a MemTest86 test for 24 hours, but it didn’t find any errors, so the RAM is also unlikely to be the culprit.
I thought I was losing my mind until it dawned on me that this could have been caused by cosmic rays. Looking at ASCII table, we see that ’d’ and ‘l’ are indeed one bit flip away from each other:
char | decimal | binary |
---|---|---|
d | 100 | 1100100 |
l | 108 | 1101100 |
Could that be it? Isn’t that incredibly unlikely? There doesn’t appear to be a consensus in existing literature 1 2 3 on how often these errors actually occur, but I did learn that this phenomenon is actually more common than I thought. The chances of flipping a bit that would cause software to break in this fashion, on the other hand, must be incredibly small.
In the end, there’s just no way to prove that this was actually caused by cosmic rays. It could also have been some other extremely rare hardware fluke. Either way, this took me by surprise, to say the least.
Perhaps this is the cosmos’ way of telling me to invest in ECC RAM.