Leveson & Turner (1993): The Therac-25 Accidents

Between June 1985 and January 1987, the Therac-25 — a computer-controlled medical linear accelerator — massively overdosed at least six patients with ionizing radiation. Three died. The machine had replaced the hardware safety interlocks of its predecessor, the Therac-20, with software checks. The software had race conditions that the Therac-20’s hardware had masked. When the hardware was removed, the software bugs became the only thing between a 25 MeV electron beam and the patient.

Nancy Leveson and Clark Turner published this investigation in 1993, drawing on FDA records, manufacturer correspondence, operator testimony, and source code analysis. The paper is the most cited work in safety-critical software literature. It established the principles that now underpin software safety standards across medical, aviation, and nuclear industries.

The Dual-Mode Accelerator

The Therac-25 was a medical linear accelerator that produced therapeutic radiation in two modes:

Mode	Energy	Beam Path	Safety Requirement
Electron	5-25 MeV	Direct electron beam through scanning magnets	Magnets must spread beam across treatment area
X-ray	25 MeV	Electrons hit tungsten target; X-rays pass through flattening filter	Target and filter must be in beam path

A stainless steel turntable rotated to place the correct hardware in the beam path for each mode. In X-ray mode, the tungsten target absorbed the 25 MeV electrons and produced X-ray photons; the flattening filter shaped the resulting beam. In electron mode, scanning magnets spread the lower-energy beam across the treatment field.

The critical danger: if the machine fired a 25 MeV beam with the turntable in the electron position, the patient would receive the full concentrated electron beam with no target to convert it to X-rays and no flattening filter to spread it. The resulting dose could be a hundred times the intended therapeutic level.

From Therac-20 to Therac-25: Removing the Safety Net

The Therac-20 had the same dual-mode design and much of the same control software. But the Therac-20 also had independent hardware interlock circuits — electromechanical devices that physically verified turntable position, beam energy, and mode consistency before allowing beam activation. These interlocks operated independently of the computer. If the software commanded the wrong configuration, the hardware would refuse to fire.

The Therac-25 removed the hardware interlocks. The PDP-11/23 computer that controlled the beam was now solely responsible for verifying that the machine configuration matched the selected treatment mode. A single software fault could simultaneously command the wrong configuration and fail to detect it.

This is the same structural failure as Ariane 5. The Ariane 4’s SRI software had an unprotected 64-to-16-bit conversion that was safe because Ariane 4’s trajectory kept the value within range. When the software was reused in Ariane 5 — where the trajectory was different — the protection was absent and the conversion overflowed. In both cases, hardware properties of the predecessor system masked software defects, and the software was reused in a successor system where those properties no longer held.

The Software Architecture

The Therac-25 ran on a 32K-word PDP-11/23 under a custom real-time executive written in PDP-11 assembly language. The executive scheduled concurrent tasks by priority and handled interrupt-driven I/O. The key tasks:

Task	Function
Keyboard handler	Reads operator input from VT-100 terminal, updates parameter table
Treatment monitor (Treat)	Sequences beam setup: commands turntable, sets energy, verifies consistency, fires beam
Housekeeper	Background display updates and status monitoring
I/O handlers	Beam energy control, dose monitoring, turntable position sensing

The keyboard handler and the treatment monitor operated concurrently on shared data structures — the parameter table and various setup flags. There was no mutual exclusion protecting these shared variables. Both tasks could read and write the same memory locations, with scheduling determined by the real-time executive’s interrupt-driven priority scheme.

The Race Condition

The overdose mechanism that caused the Yakima and Tyler incidents was a race between operator input and beam setup.

The operator entered treatment parameters on a VT-100 terminal: mode (X-ray or electron), energy level, dose, and field geometry. When all fields were complete, the system displayed them for verification. The operator could use cursor keys to edit any field, then press a key to proceed to treatment.

The treatment monitor task (Treat) ran a multi-step setup sequence: read the parameter table, command the turntable to the correct position, set the beam energy, verify consistency across all parameters. A set of shared flags tracked the progress of this sequence.

The race:

Operator selects X-ray mode, enters all parameters, presses proceed.
Treat begins X-ray setup: commands the turntable to the X-ray position (tungsten target in beam path), sets 25 MeV beam energy.
Operator notices an error in one of the fields. Uses cursor keys to navigate to the field. Types a correction. Changes the mode from X-ray to electron. Presses proceed again. Total elapsed time: under 8 seconds.
The keyboard handler writes the new mode (electron) to the parameter table and resets some setup flags — but not all of them.
Treat detects the parameter change and starts a new setup pass. But because certain flags were not fully reset, Treat does not re-command the turntable. It proceeds with the turntable still in (or moving toward) the X-ray position — while the parameter table now says “electron mode.”
The beam fires at 25 MeV with no target and no scanning magnets in the path. The patient receives the full concentrated beam.

The timing dependency was critical. If the operator took more than about 8 seconds to make the correction, Treat completed its first setup pass, detected the parameter change, and performed a full reset — the system worked correctly. Only when the operator edited and re-confirmed within the setup cycle’s timing window did the inconsistent flag state produce the wrong turntable command.

Fast, experienced operators were the most likely to trigger the failure. The operators at Tyler and Yakima had developed efficient editing habits from months of routine use. Their speed fell consistently within the critical window.

The Counter Overflow

A separate software defect contributed to at least one incident. A subroutine in the beam-setup verification routine used a one-byte counter that incremented on each pass through the test cycle. Every 256th pass, the counter overflowed from 255 to 0, and the test effectively evaluated to “safe” regardless of actual machine state. If the turntable happened to be in the wrong position on that specific pass, the software would not detect the mismatch.

The probability on any single treatment was approximately 1/256, multiplied by the probability of a concurrent turntable positioning error. But the Therac-25 administered thousands of treatments. Over its operational lifetime, the compound probability made the event inevitable.

The Pattern of Incidents

Six incidents occurred across three sites over 20 months. The first two (Kennestone, Georgia and Hamilton, Ontario in 1985) were never definitively attributed to a specific software fault — the machines were returned to service before thorough investigation. The next four (Yakima, Washington and Tyler, Texas in 1986-1987) involved confirmed overdoses, and three of those patients died from radiation injuries.

A consistent pattern ran through the incidents: the machine displayed a “Malfunction 54” error code when it detected an inconsistency, but operators had learned to treat these messages as routine nuisances. The error messages appeared frequently during normal operation, and the standard practice was to press “P” (proceed) to clear them. The error display for a lethal overdose was visually identical to the display for a routine transient. In some cases, the operator pressed proceed and the machine fired again, delivering a second overdose.

Why the Manufacturer’s Response Failed

After each incident, AECL (Atomic Energy of Canada Limited) responded with targeted patches. After Hamilton, they added a hardware microswitch on the turntable. After Yakima, they limited cursor editing speed to widen the timing window. After the first Tyler incident, they added a dose monitor and disabled cursor editing during setup. Each fix addressed the specific trigger of the most recent incident without analyzing the software architecture for the class of defect that produced it.

AECL repeatedly assured operators and regulators that the Therac-25 “could not” overdose patients. After the Hamilton incident, AECL’s response letter stated that the machine’s safety systems made an overdose “not possible.” This was false — the hardware interlocks that would have made it true had been removed.

It took six incidents, three deaths, and direct FDA intervention before AECL performed a comprehensive safety analysis of the software. The final corrective action — retrofitting independent hardware interlocks on all Therac-25 units — restored the safety layer that had been designed out of the machine.

The Regulatory Gap

The FDA’s oversight of medical device software in the 1980s was, as Leveson describes it, in its infancy. The regulatory framework focused on hardware: materials, mechanical safety, electrical safety, radiation output calibration. Software was treated as a component of the device, not as an independent safety-critical system requiring its own review.

The FDA had no established process for reviewing source code, requiring formal software specifications, or mandating independent software safety analysis for medical devices. AECL’s premarket safety analysis assigned a software failure probability of $10^{-4}$ per treatment — a number derived by applying hardware reliability analysis techniques to software. This is a category error: hardware fails stochastically (a component degrades), while software fails deterministically (the same inputs always produce the same outputs). A software bug does not have a “failure rate” — it has triggering conditions that either occur or do not.

The Therac-25 incidents were a direct catalyst for the FDA’s subsequent development of software review guidelines for medical devices, culminating in standards that now require hazard analysis, formal specification, and independent verification for safety-critical medical software.

Leveson’s Lessons

The paper’s final section identifies systemic causes that extend well beyond any single manufacturer or device:

Overdependence on software. Replacing hardware interlocks with software checks eliminates the independence between the control system and the safety system. When the same processor that commands the beam also verifies the beam configuration, a single fault can defeat both functions. Independent safety systems — hardware interlocks, hardwired shutoffs, independent monitoring — exist precisely because software cannot be trusted to check itself.

Reuse without revalidation. The Therac-20 software was reused in the Therac-25 without re-examining the assumptions that made it safe. The critical assumption — that hardware interlocks would catch any software-induced misconfiguration — was no longer true. This is not an argument against reuse. It is an argument that reuse is a design activity, not a logistics activity, and that inherited software must be revalidated against the actual safety architecture of the new system.

Inadequate testing of concurrent behavior. The race condition required specific timing between two concurrent tasks. Single-threaded functional testing — enter parameters, verify output, check beam configuration — could not exercise the failure mode. Testing concurrent systems requires deliberately probing timing windows, task preemption sequences, and shared-state interactions.

Confusing reliability with safety. AECL’s safety case treated software reliability (how often does it fail?) as equivalent to software safety (when it fails, what happens?). A system can be highly reliable — failing rarely — and still be unsafe, if the consequences of the rare failures are catastrophic. Safety requires analyzing what happens when things go wrong, not just how often they go wrong.

Uninformative error handling. The Therac-25’s error messages were cryptic codes that operators could not distinguish from routine transients. When every error looks the same, operators learn to ignore all of them. Error reporting in safety-critical systems must distinguish between conditions that are safe to override and conditions that require stopping.

Five Failure Modes at Interface Boundaries

This collection now documents five software failures. Each involves a different mechanism, but all five occur at an interface boundary — between components, between teams, between systems, between the software and the physical world.

Document	Failure Mode	Interface Boundary
Lions/Ariane 5 (1996)	Type overflow in reused code	Between Ariane 4 assumptions and Ariane 5 trajectory
MCO (1999)	Units mismatch	Between LMA (imperial) and JPL (metric)
Garman/Shuttle (1981)	Timing synchronization	Between synchronous BFS and asynchronous PASS
Reeves/Pathfinder (1997)	Priority inversion	Between application code and COTS RTOS defaults
Leveson/Therac-25 (1993)	Race condition without interlocks	Between operator input task and beam setup task

The Therac-25 shares the deepest structural parallel with Ariane 5: both are software reuse failures where hardware in the predecessor system masked software defects. In both cases, the successor system removed the hardware protection — by design — and the unmasked software bugs produced catastrophic results.

The Therac-25 extends this collection beyond spacecraft into safety-critical software broadly. The engineering lessons are identical: components that work correctly in isolation can fail at their boundaries; hardware interlocks and software checks serve different safety functions and are not interchangeable; reuse requires revalidation; and testing must exercise the interfaces, not just the components.

Lions (1996): Ariane 5 Flight 501 The structural twin -- software reused from a predecessor where hardware masked the bug

Reeves (1997): Mars Pathfinder Another concurrent access failure -- priority inversion in VxWorks RTOS

Hamilton (1976): Higher Order Software Formal methods for eliminating the class of interface errors that killed these systems

Space Software Engineering Collection overview and document inventory