Reeves (1997): What Really Happened on Mars
On July 4, 1997, Mars Pathfinder landed on Mars and began sending back data. Then it started resetting. Repeatedly. The spacecraft’s watchdog timer — a hardware safety mechanism that reboots the computer if the software stops responding — was firing because a high-priority task couldn’t complete its work on time. The cause was priority inversion: a textbook real-time systems failure that had been known to theorists since the 1980s, occurring for the first time on another planet.
The author, Glenn Reeves, led the Pathfinder flight software team at JPL. His account is not a formal report — it is an email, sent to correct the record after a secondhand summary of a conference talk circulated widely. It is specific, first-person, and precise about what went wrong, why testing missed it, and how the team fixed it from 190 million kilometers away.
The Architecture
Section titled “The Architecture”Mars Pathfinder ran on a single CPU on a VME bus, controlling the radio, camera, and a MIL-STD-1553 data bus connecting to cruise-stage and lander hardware. The operating system was VxWorks, a commercial real-time OS from Wind River Systems. The 1553 bus interface hardware was inherited from Cassini, and it came with a constraint: software must schedule bus activity at an 8 Hz rate. This dictated the fundamental software architecture.
Two tasks managed the 1553 bus on a repeating 0.125-second cycle:
| Task | Priority | Role |
|---|---|---|
bc_sched | Highest (except VxWorks tExec) | Sets up 1553 bus transactions for the next cycle |
bc_dist | Third highest | Collects and distributes data from completed transactions |
Between these two sat a task controlling entry and landing (second highest priority). Below bc_dist were various spacecraft housekeeping tasks, and below those were science tasks: imaging, image compression, and the ASI/MET meteorological instrument.
Most tasks accessed 1553 bus data through a double-buffered shared memory mechanism. The exception was ASI/MET, which received its data through VxWorks’ inter-process communication pipes, using the select() mechanism to wait for messages. This difference mattered.
The Inversion
Section titled “The Inversion”Each 8 Hz cycle, bc_sched and bc_dist checked that the other had completed its work. If bc_dist had not finished before bc_sched activated — a hard deadline — the system declared an error and reset. The failure sequence:
-
The ASI/MET task (low priority) calls
select()on a pipe. Internally,select()callspipeIoctl(), which callsselNodeAdd(), which acquires a mutex semaphore to protect the list of file descriptors. -
ASI/MET is preempted before it can release the mutex. Several medium-priority tasks are ready to run, and they do.
-
bc_dist(high priority) activates and tries to send new ASI/MET data viapipeWrite(). The write operation needs the same mutex.bc_distblocks. -
The medium-priority tasks continue running. They don’t need the mutex, so they aren’t blocked. But their priority is higher than ASI/MET’s, so ASI/MET can’t run either. The mutex stays held.
-
The next 8 Hz boundary arrives.
bc_schedactivates and finds thatbc_disthas not completed its cycle. Hard deadline missed. Watchdog fires. System resets.
The result: the spacecraft rebooted. It reinitialized all hardware and software, terminated the current day’s activities, and resumed the next day. No collected science or engineering data was lost — data in RAM was preserved across resets. But each reset cost a day of planned operations.
Why Testing Missed It
Section titled “Why Testing Missed It”Pre-launch testing focused on peak data rates and maximum science activity — the “best case” for throughput. The failure required a specific combination: the ASI/MET task collecting data at moderate rates while medium-priority tasks were heavily loaded. This was a realistic operational scenario, but it was not in the test matrix.
Reeves is direct about this: “We did not expect nor test the ‘better than we could have ever imagined’ case.” Surface data rates turned out to be higher than anticipated, and science activities were proportionally greater. The conditions that triggered the inversion were a consequence of the mission succeeding beyond expectations.
The team had seen the symptom before landing but could not reproduce it. It was not forgotten. It was deprioritized relative to entry and landing software — a defensible decision, given that the system was designed to survive resets. Reeves: “We had our priorities right.”
Diagnosis and Repair
Section titled “Diagnosis and Repair”The flight software contained diagnostic instrumentation that was left in by design. A trace/log facility, originally built by David Cummings to debug an early VxWorks port issue, had been extended by Wind River’s Lisa Stanley to instrument pipe services, message queues, interrupt handling, select() services, and the tExec task. The facility ran continuously in ring buffers, collecting data that could be dumped on command.
The bc_sched task was already coded to stop the trace collection and trigger a dump when it detected the missed-deadline error. When the team set up the same activities in the lab, they reproduced the failure in under 18 hours. Once the trace data showed the task scheduling sequence, the priority inversion was, in Reeves’ word, “obvious.”
The fix: change a global configuration variable in VxWorks to enable priority inheritance on the semaphores created by selectLib(). When a high-priority task blocks on a mutex held by a low-priority task, the RTOS temporarily raises the low-priority task’s scheduling priority to match, allowing it to finish and release the mutex without being preempted by medium-priority tasks.
The change was not trivial to validate. The configuration variable was global — it affected all select() semaphores in the system, not just the one used by the ASI/MET pipe. Wind River analyzed the potential impacts and concluded that performance impact was minimal and that select() behavior would not change as long as only one task waited on any given file descriptor (true in Pathfinder’s case). The team tested extensively before uploading.
The corrected software was sent to Mars via a delta-patching mechanism — sending only the differences between the onboard and ground copies, with extensive validation on both ends. It was the first software patch transmitted to another planet.
Design for Recovery
Section titled “Design for Recovery”The watchdog/reset mechanism was not a defect. It was a deliberate safety feature. The system was designed to detect when the software stopped meeting its deadlines and to recover automatically. Scientific data already collected was preserved in RAM across resets. The spacecraft was built to handle multiple resets during atmospheric entry — the most violent and time-critical phase of the mission.
This design philosophy made the priority inversion a nuisance rather than a catastrophe. The spacecraft kept operating. Science data kept being collected. The ground team had time to diagnose the problem, develop a fix, test it, and upload it — all while the mission continued.
Contrast this with Ariane 5, where the SRI’s exception handler had one response to any error: shut down. In that architecture, the first failure was the last. Pathfinder’s architecture assumed failures would happen and built recovery into the nominal operating mode.
COTS Is Not Free
Section titled “COTS Is Not Free”The Pathfinder team did not make a conscious decision to omit priority inheritance on the select() semaphore. They didn’t know the semaphore existed. The interface between application code and COTS operating system was an abstraction boundary that hid a scheduling hazard.
This is not an argument against COTS. VxWorks worked. Wind River’s engineers were, by Reeves’ account, excellent partners who delivered an RS6000 port in three months and provided critical support during the anomaly investigation. The argument is that using COTS software does not eliminate the need to understand the platform’s behavior at the level where your application’s correctness depends on it.
The Four Failure Modes
Section titled “The Four Failure Modes”This collection now contains four spacecraft software failure analyses. Each documents a different failure mechanism, but they share a structural pattern:
| Document | Failure Mode | Root Cause Category |
|---|---|---|
| Lions/Ariane 5 (1996) | Type overflow in reused code | Software reuse without revalidation |
| Reeves/Pathfinder (1997) | Priority inversion | RTOS scheduling / COTS defaults |
| MCO (1999) | Units mismatch | Interface specification |
| Garman/Shuttle (1981) | Timing synchronization | Async/sync interaction |
All four failures occurred at an interface boundary — between components, between teams, between specifications and implementations. In each case, the individual components worked correctly. The Ariane 5 SRI detected its own overflow. The MCO navigation software computed trajectories accurately from the data it received. The Shuttle PASS operated flawlessly; only the BFS synchronization failed. Pathfinder’s bc_dist and ASI/MET tasks each did their job. The failure lived in the gap between them.
Pathfinder stands apart in one respect: it was the only mission that survived its software failure. Not because the bug was less severe, but because the system was designed to recover from exactly this kind of problem. The watchdog timer, the RAM-preserving reset, the diagnostic instrumentation left in the flight code, the delta-patching capability — these were not afterthoughts. They were engineering decisions made before launch, by a team that assumed things would go wrong and planned for what to do when they did.