Reeves (1997): What Really Happened on Mars

On July 4, 1997, Mars Pathfinder landed on Mars and began sending back data. Then it started resetting. Repeatedly. The spacecraft’s watchdog timer — a hardware safety mechanism that reboots the computer if the software stops responding — was firing because a high-priority task couldn’t complete its work on time. The cause was priority inversion: a textbook real-time systems failure that had been known to theorists since the 1980s, occurring for the first time on another planet.

The author, Glenn Reeves, led the Pathfinder flight software team at JPL. His account is not a formal report — it is an email, sent to correct the record after a secondhand summary of a conference talk circulated widely. It is specific, first-person, and precise about what went wrong, why testing missed it, and how the team fixed it from 190 million kilometers away.

The Architecture

Mars Pathfinder ran on a single CPU on a VME bus, controlling the radio, camera, and a MIL-STD-1553 data bus connecting to cruise-stage and lander hardware. The operating system was VxWorks, a commercial real-time OS from Wind River Systems. The 1553 bus interface hardware was inherited from Cassini, and it came with a constraint: software must schedule bus activity at an 8 Hz rate. This dictated the fundamental software architecture.

Two tasks managed the 1553 bus on a repeating 0.125-second cycle:

Task	Priority	Role
`bc_sched`	Highest (except VxWorks `tExec`)	Sets up 1553 bus transactions for the next cycle
`bc_dist`	Third highest	Collects and distributes data from completed transactions

Between these two sat a task controlling entry and landing (second highest priority). Below bc_dist were various spacecraft housekeeping tasks, and below those were science tasks: imaging, image compression, and the ASI/MET meteorological instrument.

Most tasks accessed 1553 bus data through a double-buffered shared memory mechanism. The exception was ASI/MET, which received its data through VxWorks’ inter-process communication pipes, using the select() mechanism to wait for messages. This difference mattered.

The Inversion

Each 8 Hz cycle, bc_sched and bc_dist checked that the other had completed its work. If bc_dist had not finished before bc_sched activated — a hard deadline — the system declared an error and reset. The failure sequence:

The ASI/MET task (low priority) calls select() on a pipe. Internally, select() calls pipeIoctl(), which calls selNodeAdd(), which acquires a mutex semaphore to protect the list of file descriptors.
ASI/MET is preempted before it can release the mutex. Several medium-priority tasks are ready to run, and they do.
bc_dist (high priority) activates and tries to send new ASI/MET data via pipeWrite(). The write operation needs the same mutex. bc_dist blocks.
The medium-priority tasks continue running. They don’t need the mutex, so they aren’t blocked. But their priority is higher than ASI/MET’s, so ASI/MET can’t run either. The mutex stays held.
The next 8 Hz boundary arrives. bc_sched activates and finds that bc_dist has not completed its cycle. Hard deadline missed. Watchdog fires. System resets.

The result: the spacecraft rebooted. It reinitialized all hardware and software, terminated the current day’s activities, and resumed the next day. No collected science or engineering data was lost — data in RAM was preserved across resets. But each reset cost a day of planned operations.

Why Testing Missed It

Pre-launch testing focused on peak data rates and maximum science activity — the “best case” for throughput. The failure required a specific combination: the ASI/MET task collecting data at moderate rates while medium-priority tasks were heavily loaded. This was a realistic operational scenario, but it was not in the test matrix.

Reeves is direct about this: “We did not expect nor test the ‘better than we could have ever imagined’ case.” Surface data rates turned out to be higher than anticipated, and science activities were proportionally greater. The conditions that triggered the inversion were a consequence of the mission succeeding beyond expectations.

The team had seen the symptom before landing but could not reproduce it. It was not forgotten. It was deprioritized relative to entry and landing software — a defensible decision, given that the system was designed to survive resets. Reeves: “We had our priorities right.”

Diagnosis and Repair

The flight software contained diagnostic instrumentation that was left in by design. A trace/log facility, originally built by David Cummings to debug an early VxWorks port issue, had been extended by Wind River’s Lisa Stanley to instrument pipe services, message queues, interrupt handling, select() services, and the tExec task. The facility ran continuously in ring buffers, collecting data that could be dumped on command.

The bc_sched task was already coded to stop the trace collection and trigger a dump when it detected the missed-deadline error. When the team set up the same activities in the lab, they reproduced the failure in under 18 hours. Once the trace data showed the task scheduling sequence, the priority inversion was, in Reeves’ word, “obvious.”

The fix: change a global configuration variable in VxWorks to enable priority inheritance on the semaphores created by selectLib(). When a high-priority task blocks on a mutex held by a low-priority task, the RTOS temporarily raises the low-priority task’s scheduling priority to match, allowing it to finish and release the mutex without being preempted by medium-priority tasks.

The change was not trivial to validate. The configuration variable was global — it affected all select() semaphores in the system, not just the one used by the ASI/MET pipe. Wind River analyzed the potential impacts and concluded that performance impact was minimal and that select() behavior would not change as long as only one task waited on any given file descriptor (true in Pathfinder’s case). The team tested extensively before uploading.

The corrected software was sent to Mars via a delta-patching mechanism — sending only the differences between the onboard and ground copies, with extensive validation on both ends. It was the first software patch transmitted to another planet.

Design for Recovery

The watchdog/reset mechanism was not a defect. It was a deliberate safety feature. The system was designed to detect when the software stopped meeting its deadlines and to recover automatically. Scientific data already collected was preserved in RAM across resets. The spacecraft was built to handle multiple resets during atmospheric entry — the most violent and time-critical phase of the mission.

This design philosophy made the priority inversion a nuisance rather than a catastrophe. The spacecraft kept operating. Science data kept being collected. The ground team had time to diagnose the problem, develop a fix, test it, and upload it — all while the mission continued.

Contrast this with Ariane 5, where the SRI’s exception handler had one response to any error: shut down. In that architecture, the first failure was the last. Pathfinder’s architecture assumed failures would happen and built recovery into the nominal operating mode.

COTS Is Not Free

The Pathfinder team did not make a conscious decision to omit priority inheritance on the select() semaphore. They didn’t know the semaphore existed. The interface between application code and COTS operating system was an abstraction boundary that hid a scheduling hazard.

This is not an argument against COTS. VxWorks worked. Wind River’s engineers were, by Reeves’ account, excellent partners who delivered an RS6000 port in three months and provided critical support during the anomaly investigation. The argument is that using COTS software does not eliminate the need to understand the platform’s behavior at the level where your application’s correctness depends on it.

The Four Failure Modes

This collection now contains four spacecraft software failure analyses. Each documents a different failure mechanism, but they share a structural pattern:

Document	Failure Mode	Root Cause Category
Lions/Ariane 5 (1996)	Type overflow in reused code	Software reuse without revalidation
Reeves/Pathfinder (1997)	Priority inversion	RTOS scheduling / COTS defaults
MCO (1999)	Units mismatch	Interface specification
Garman/Shuttle (1981)	Timing synchronization	Async/sync interaction

All four failures occurred at an interface boundary — between components, between teams, between specifications and implementations. In each case, the individual components worked correctly. The Ariane 5 SRI detected its own overflow. The MCO navigation software computed trajectories accurately from the data it received. The Shuttle PASS operated flawlessly; only the BFS synchronization failed. Pathfinder’s bc_dist and ASI/MET tasks each did their job. The failure lived in the gap between them.

Pathfinder stands apart in one respect: it was the only mission that survived its software failure. Not because the bug was less severe, but because the system was designed to recover from exactly this kind of problem. The watchdog timer, the RAM-preserving reset, the diagnostic instrumentation left in the flight code, the delta-patching capability — these were not afterthoughts. They were engineering decisions made before launch, by a team that assumed things would go wrong and planned for what to do when they did.

Sha, Rajkumar & Lehoczky (1990): Priority Inheritance The theoretical solution to the exact failure Reeves diagnosed -- the algorithm behind the one-flag fix

Garman (1981): The BUG Heard 'Round the World Another scheduling failure in flight software -- async/sync timing interaction

Lions (1996): Ariane 5 Flight 501 Software reuse failure -- the mission that did not survive its bug

MCO (1999): Mars Climate Orbiter Interface specification failure -- units mismatch over 9 months