Garman (1981): The BUG Heard 'Round the World

On April 10, 1981, twenty minutes before the first planned launch of the Space Shuttle, astronauts and technicians attempted to initialize the Backup Flight System in the fifth onboard computer. It refused to synchronize. The launch was scrubbed. Two days later, after a power cycle of the computers cleared the problem, STS-1 launched successfully — but it took eight more hours of post-launch analysis before anyone fully understood what had gone wrong.

The author, John “Jack” Garman, was Deputy Chief of NASA’s Spacecraft Software Division. He had been the engineer in Mission Control who identified and cleared the 1202/1203 program alarms during Apollo 11’s lunar descent eighteen years earlier. His account bridges two eras of spaceflight software and remains one of the clearest practitioner-written failure analyses in the literature.

The Shuttle Computer Architecture

The Space Shuttle Orbiter carried five IBM AP-101 General Purpose Computers (GPCs). During critical flight phases, four ran identical copies of the Primary Avionics Software System (PASS), developed by IBM Federal Systems Division in Houston. The fifth ran the Backup Flight System (BFS), independently developed by Rockwell International in Downey, California.

System	GPCs	OS Design	Developer
PASS	4 (identical loads)	Asynchronous, priority-driven	IBM Federal Systems
BFS	1	Synchronous, time-slotted	Rockwell International

Both systems used the same HAL/S programming language (by Intermetrics, Cambridge, MA), the same requirements specifications, and the same target hardware. The independent development of the BFS was a deliberate defense against common-mode software failures — a catastrophic bug in the PASS could bring down all four primary computers simultaneously, turning the Orbiter into what Garman calls “an inert mass of tiles, wires, and airframe.”

The Synchronization Problem

The BFS, when not in active control, maintained readiness by listening to data bus traffic from the PASS-controlled GPCs. By eavesdropping on sensor data fetches, the BFS could independently track vehicle state and be ready for an instant takeover if the crew switched control.

This created a fundamental interface problem. The BFS (synchronous, time-slotted) needed to predict exactly when the PASS (asynchronous, priority-driven) would perform its data fetches. If the BFS heard unexpected traffic on a bus, it would stop listening to that entire bus to prevent “pollution” — a PASS failure contaminating the backup’s state.

To accommodate the BFS, the PASS was modified to schedule most cyclic processing relative to a cycle counter in a high-priority system process, creating what Garman describes as an “apparent synchronous operation” layered on top of the asynchronous executive. The four PASS computers maintained bit-for-bit identical data using eight sync state codes and roughly 6% processing overhead.

But asynchronous and synchronous systems, Garman observes, “don’t mix well.”

How the PASS Synchronizes With Telemetry

When the first GPC is powered on and the PASS is loaded, it synchronizes its cyclic processing with the vehicle’s telemetry system:

Read time from the telemetry system (which samples the same central clock as the GPCs)
Calculate the telemetry phase relative to the central clock
Compute a future start time that will synchronize PASS processing with telemetry output
Initiate a high-priority system process at that start time
All other cyclic processing starts relative to that process and its cycle counter

The BFS then uses the cycle counter (passed on the data bus) to know when PASS data fetches will occur.

A critical detail: the four PASS GPCs cannot read their hardware clocks directly for time-critical decisions. If they did, each would get a slightly different value, and operations like “fire pyrotechnics on the first cycle after 3:00” could sliver — some GPCs acting on one cycle, some on the next.

Instead, the GPCs use the top entry of the operating system’s timer queue as a deterministic clock proxy. With hundreds of cyclic processes scheduled every second, the top entry is always a close approximation of current time, always slightly in the future, and always bit-for-bit identical across all redundant GPCs.

When the very first GPC powers on, the timer queue should be empty — no processes are running. The queue is initialized with a known fill pattern, and a test for that pattern serves as the “first GPC on” detection. In that case, the GPC is allowed to use its hardware clock directly.

The Bug

The timer queue was not empty.

About two years before the flight, a common subroutine was adopted for data bus initialization. It was called prior to the start-time calculation. The subroutine happened to contain a delay that placed a time entry in the timer queue. Nobody noticed — what does a bus initialization routine have to do with a telemetry phasing calculation?

At that point, the delay was small enough that the queue entry was close to current time. The start-time calculation, reading the queue as if it were the clock, got a value only slightly in the future. No harm done.

About a year later — roughly a year before the flight — the delay constant was increased. The change was made to prevent a CPU overload during flight control processing. “Just a constant in the code,” Garman notes. But the increase pushed the timer queue entry further into the future.

Now the start-time calculation, reading that future queue entry as if it were current time, could compute a start time that it believed was in the past. “Past” is not a physical observation for a computer — it is the result of a subtraction: start_time - current_time < 0. When the “current time” is actually a future time from the queue, a perfectly valid start time can appear to have already passed.

When the operating system receives a start time in the past, it does the sensible thing: it slips the process forward by the number of cycles needed to put it in the future — like an alarm clock that wraps around. The master system process started one cycle late. All processing tied to the cycle counter was one cycle late.

Uplink polling — one of the few processes not started relative to the cycle counter — remained on time. To the BFS, it appeared one cycle early: unexpected bus traffic on strings 1 and 3. The BFS stopped listening. It never started.

Why Testing Missed It

Garman catalogs the reasons with clear-eyed precision:

Low probability — 1 in 67 per cold initialization, never observed on the vehicle or in labs before flight day
Test shortcuts — most simulations used reset or restart points rather than full IPL (Initial Program Load), bypassing the vulnerable startup code entirely
Cross-domain invisibility — the connection between a delay constant in a bus initialization routine and the start-time calculation ran through the timer queue, an implicit coupling that no static analyzer could trace. “No ‘mapping’ analyzer built today could have found that linkage.”
Late window opening — the probability window was created by the delay constant change ~1 year before flight, after most initialization testing was complete
Dismissed lab incident — a similar failure apparently occurred in a lab ~4 months before flight, but was attributed to a lab setup artifact or masked by another software change

The bug was latching: once present, the one-cycle offset was self-maintaining through all GPC reconfigurations. If absent (correct initialization), the correct phasing was equally stable. This meant the fix was straightforward — power-cycle the GPCs and check the phasing — but it also meant the problem was invisible during normal operation.

From Apollo to Shuttle

Garman does not dwell on his Apollo experience in this paper, but the connection illuminates both programs. During Apollo 11’s lunar descent, the AGC’s priority-driven executive shed lower-priority tasks when a hardware rendezvous radar interface flooded it with interrupts — the 1202 and 1203 alarms that Garman, then a 24-year-old engineer in Mission Control, recognized as non-fatal.

The Apollo AGC was a single computer per spacecraft. Its reliability came from software design: a priority executive that degraded gracefully under overload, restart protection that preserved critical state, and an astronaut interface (the DSKY) that kept the crew informed.

The Shuttle took the opposite approach: reliability through hardware replication. Four identical computers voting, a fifth with independent software, 25 data buses, four strings of sensors and effectors. The software — the PASS — was more complex by orders of magnitude, and the interaction between the synchronous BFS and the asynchronous PASS introduced failure modes that did not exist in either system alone.

Garman’s implicit observation: the Shuttle traded one kind of complexity (a single computer that must never fail) for another (multiple computers that must agree perfectly). The bug that scrubbed STS-1 could not have existed in the Apollo architecture. It was a product of the very redundancy designed to prevent catastrophic failure.

The Lessons

Garman’s concluding sections are a plea to the software engineering community, written with the authority of someone who has shipped flight software for two decades:

Software development is iterative, not sequential. The waterfall model — requirements complete before design, design complete before code — assumes “the project knows what it wants out of the software in the beginning.” Garman’s experience: “Software development from my perspective is almost never that way.”

The real problem is reliable modification. Building correct software from a clean sheet is not the hard part. “Maintaining software systems in the field, absorbing large changes or additions in the middle of development cycles, and reconfiguring software systems to ‘fit’ never-quite-identical vehicles or missions are our real problems today.”

Software is the last subsystem to stabilize — and the first place where problems in other subsystems get fixed. “Software by its nature is the easiest place to correct problems — but by that very nature, it becomes a tyrant to its users and a tenuous and murky unknown to the analysts.”

The bug was not flight-critical. The PASS operated correctly in every respect. Only the BFS synchronization was affected. But the inability to initialize the backup system 20 minutes before a crewed launch is exactly the kind of problem that erodes confidence in ways no reliability analysis can quantify.

Carlow (1984): Shuttle PASS Architecture The system Garman's bug affected -- four redundant GPCs, synchronous voting, and HAL/S

Reeves (1997): Mars Pathfinder Priority Inversion Another timing/scheduling failure -- priority inversion in VxWorks RTOS, diagnosed and patched from 190 million km away

Lions (1996): Ariane 5 Flight 501 Another failure born from the interaction of redundancy and reuse

Hoag (1963): Apollo G&N The system architecture Garman worked under before Shuttle