Get your app to Mars!
This is a blog post about firmware updates, and I was inspired to write it by the news that NASA’s Curiosity rover on Mars has got an OTA update. The firmware image was about 21MB and took 11 days to send it over-the-air (or in this case, over-the-vacuum: Mars is currently 242 million kilometres from Earth).
NASA has long included OTA update capabilities in space missions: the Voyager probes upgraded with new algorithms long after they launched, including newly-invented bit encoders to increase bandwidth.
You’d think that a spaceship millions of miles from Earth would be the most challenging vehicle for firmware updates - if anything goes wrong you can’t send anyone to fix it. But it turns out that there are even more challenging vehicles to update with new firmware: cars.
Cars are particularly challenging for a simple reason: there are lots of them.
Automotive software is very different from desktop PC software or apps in phones: it’s embedded distributed hard real-time control software. What this means is that multiple electronic control units (ECUs) run software that executes control algorithms hundreds of times per second, and that software has to run on-time, every time. That’s very different from mainstream computing, where the user interface juddering a bit, or a video playback stuttering sometimes, is no big deal. If ECU software fails to run on-time then there are real world consequences: the software is controlling two tons of speeding metal in an environment shared with people.
The reason lots of cars is a particular challenge is that the ‘flying hours’ racked up by software in car platforms is immense. For example, Toyota’s TNGA-K platform alone has software that runs for literally hundreds of billions of hours a year (12 models and millions of cars each containing hundreds of CPUs). For comparison, the Boeing 737 has racked up a few hundred million flying hours over its entire existence since 1967.
This huge software run-time means that Murphy’s Law applies: “If it can go wrong, then within billions of hours of run-time, it will go wrong”. Testing can never come close to those billions of hours, which is why design-time analysis has to be used to be sure ECU software is correct (for example, techniques like WCET Analysis).
Changes to ECU firmware have to go through extensive validation before being released, and this means that it will always take time to ready a firmware update for a vehicle, particularly if that firmware update affects multiple ECUs: cars are not controlled by a single computer but by multiple ECUs connected together via in-vehicle networks (typically CAN bus). These are not normal computer networks: they carry control messages between ECUs and have to deliver them within tight deadlines, and these have to be guaranteed before deployment (as for WCET analysis, there are scheduling techniques that allow these guarantees to be made).
All this means that it takes time to validate a new software release for a car, which might change a network configuration and the firmware on dozens of ECUs. It’s just not feasible to take a couple of days to change firmware in an ECU and push it out to all the cars.
Programming flash memory
ECU firmware is stored in flash memory (typically in the same chip as the CPU). The process of programming firmware into the flash memory of an ECU is not trivial. Flash memory works by storing electrical charge in tiny cells in silicon, and programming requires filling cells with charge. It can take quite a long time to do this: erasing a block of flash memory can take a tenth of a second. That doesn’t sound a lot, but there can be thousands of blocks to erase to write new firmware. Then once erased, the firmware has to be written in to the memory. This can take minutes to do, and if it’s programmed end of line (i.e. in the factory after the car has been assembled) then it’s quite possible for the flash programming of a whole car to take more time than the rate at which cars are coming off the line. Managing this is quite a logistical challenge.
More important than programming time is how the programming can fail. For example, if the algorithm for programming the flash doesn’t match the physical characteristics of the flash memory properly (e.g. not leaving long enough for the flash cells to be fully charged) then the flash memory can fail to hold its contents properly. When it fails quickly, then this can be picked up in verification during programming (this is still catastrophic: if it takes just a day to fix the flash programming there can be thousands of assembled cars stacked up waiting to be fixed, and that might take weeks). But it might fail slowly enough to pass verification at end-of-line and then fail later: cars being driven by customers would just stop working as their ECUs failed.
The flash bootloader
Flash programming takes place via a flash bootloader: a small piece of software that starts up and sees if there is new firmware to be programmed. It receives new firmware over a commmunications link and writes that firmware into the flash memory. At the end it computes checksums so that the programming tool on the other end of the communications link can verify that the firmware has been written to flash correctly. This is how it works in pretty much any device: robot vacuum cleaners, smart lightbulbs, phones, and cars. There are lots of ways this can go wrong. And when it does, it’s often unrecoverable: the device is as useful as a brick (hence the term bricked). Implementing a flash bootloader properly is very difficult, and so shortcuts are taken. The first shortcut is the flash bootloader being part of the firmware itself.
If the firmware includes its own bootloader, then what happens if the flash programming fails part way through? There won’t be all the firmware in place, and that means the bootloader in that firmware might not start. The programming can’t be repeated and the device is now bricked. This is why so many consumer products have messages like “Do not unplug this device!” because if the power fails or the link to the programming tool is lost then the device is bricked. There are ways to address this, but the mainstream tech industry is dominated by the minimum viable product (MVP) philosophy: don’t spend time on issues if it delays shipping a viable product.
A way to address the bricking problem during flash programming is for the bootloader to be put into its own special area of memory that can’t be erased. Then if there’s a problem, the flash programming can be re-started. But this raises another issue: what if the bootloader needs to be changed? If the bootloader is sufficiently complex then it becomes likely that there will be bugs or it will need to be changed to adapt to something else. For example, if the flash bootloader needs to communicate with a server on the internet to pick up the firmware image, then it needs a TCP/IP and SSL software stack. And those can contain security vulnerabilities.
I was in the team that designed the bootloader for Volvo’s first car platform to use flash memory everywhere (the P2X platform), and we spent a lot of time worrying about the ways in which the system could fail catastrophically. One of the things we worried about was a bug in the firmware that could cause the firmware to jump into a flash erase function and so erase itself. If that was tripped by a distance-driven or date bug then you could have seen Volvo cars all over the world slowly come to a halt as they bricked themselves. Truly catatstrophic. And this isn’t something we were alone in worrying about: chips now implement flash programming hardware that has an ‘unlock’ code: flash erase cannot be triggered until the unlock code is written into the memory, and a bug would be unlikely to write that code by accident. Unfortunately, this just moved the problem to the flash programming software that wrote the unlock code into the hardware: a software crash that accidentally calls this function will still erase the flash. What we designed in the end was a two-stage bootloader: the primary bootloader in protected memory was minimal code to communicate over CAN bus to a programming tool and write data to RAM. The programming tool then wrote a secondary bootloader, containing flash programming code, into RAM and then this was used to program the firmware. At the end of the process, the ECU restarts and the flash programming code is wiped from RAM and is never there during operation of the car. If the programming fails (perhaps due to a power fail, or the programming tool crashes) then it was always possible to re-start the flash programming: in the very worst case, the car battery had to be disconnected to completely reset the ECU into a clean state where the bootloader would always run.
OTA firmware updating
Firmware updates to ECUs in cars have traditionally been done in garage workshops by mechanics using programming tools that connect to the car’s CAN buses via the OBD-II diagnostic connector. For years now, cars have had wireless connectivity to the wider world for telematics functions like emergency notifications and remote diagnostics, typically implemented with a 3G cellular modems (although not for much longer). All the hardware is there for over-the-air updates of ECU firmare, and car makers could save huge amounts of money on a recall if they could address a problem with an over-the-air update rather than bringing millions of cars into garage workshops where mechanics apply the update, so why did car makers not support OTA updates years ago? There’s a simple answer: because of all the things that could go catastrophically wrong. Let’s look at a few ways.
The first is that a bug undiscovered during testing could cause a failure in the field for many vehicles. This happened on Mars with the Pathfinder priority inversion bug that caused a reset loop undiscovered in extensive testing on Earth. Fortunately that one was actually fixed with an OTA firmware update! But one robot on Mars and tens of millions of vehicles on the roads of Earth have different failure modes. This risk can be mitigated by rolling out updates slowly so that there is time to spot catastrophic failures (from time to time this happens with phone updates).
The second major issue is if the update process fails (perhaps a bug is tripped in the internet communications stack) it would leave cars unusable until updated by a physical tool that can connect directly to ECUs over the CAN bus at the lowest level of bootloader. This would be a catastrophe if it happened in any significant number of vehicles: there simply are not enough tow trucks to recover thousands of vehicles to workshops. Even then, if the update process is slow then the vehicle is out of action for some time.
A third major issue is security. The update process must be robust against accidental failures like power cuts during operation, but also must be robust against deliberate attacks by hackers. If the infrastructure for OTA updates is compromised, then it would be possible to override mitigations and deliberately brick vehicles.
A fourth major issue is safety. An ECU being updated is executing programming software that pushes data into regions of memory. That memory is also how I/O ports on electronics are accessed, and output ports directly control physical world machinery using actuators. For example, headlights contain motors to steer the headlights to light corners, they contain motors to pump washer fluid, and to drive wipers. In some cases, the parking brake of the car is released by writing to an output port. If the flash programming process went wrong and wrote to the wrong parts of memory then the car being updated could have its brakes released, and if parked on a hill then this could be dangerous. For safety critical systems there are processes for writing safe software (in the car industry this is emmbodied in the ISO 26262 standard). Safety critical software development is rigorous and time-consuming and so expensive. Having to engineer the entire OTA update system (including the tools, the communications systems, and the bootloader) to safety critical standards is very costly.
Because of all this, car makers have been cautious and only recently started to provide OTA updates. BMW made the news recently after a customer’s car refused to update itself while parked on a hill. While this was treated as a joke, it’s actually a sign that BMW is taking the safety issue very seriously and that it has avoided the need to engineer the OTA update system to ISO 26262 by applying other mitigations. In this case, not risking the car rolling away from an unintended release of the parking brake from writes to memory. Rather than laugh, a better response would be to ask other car makers to describe their safety case for the firmware update process.
Security is being addressed in part by the automotive industry adopting a hardware security module (HSM). The industry has defined the Secure Hardware Extensions (SHE) standard and chips designed for automotive applications include these modules. They provide not just cryptographic functions (encryption, authentication, random numbers) but also secure key distribution and storage. One of the features of the SHE HSM is that firmware can be run though the module and it will check against a securely stored value that it is authentic. This allows what’s known as a chain of trust from the first bootloader, to a secondary bootloader, to the application firmware all to be validated using this hardware. This can be used to ensure that only firmware approved by the car maker can be programmed into an ECU.
Caution and innovation
The story of OTA updates in cars is one example of a wider story of caution and innovation. The car industry is cautious because it has to be: a car is not a phone on wheels, it’s a safety critical mechatronic system on wheels. Two tons of speeding metal is not a consumer electronics device. And while from a driver’s perspective it might look as if the ‘car computer’ is like a phone, the critical parts of the car go unseen. Car platforms take several years and billions of dollars to develop, and cars based on the platform roll off the production lines for a decade. The engineering culture to do that reliably and safely without risking the entire company are so very different to the engineering culture of an MVP tech startup, and anyone wishing that car companies were like tech companies would cause a finger to curl on the monkey’s paw.