Bruce's Photo

    Software Reliability in Interplanetary Probes

Bruce Neufeld
July 14, 1999

GLY4045, Introductory Planetary Science
Professor Jeff Ryan

Introduction

Software is a critical component in the exploration of the solar system. From their earliest missions, interplanetary probes have been dependent on the proper functioning of the programs in their command, control and communications computers. Many of the most spectacular failures in space missions have been the direct result of errors generated by computer software. One of the continuing concerns of the computer science community is the improvement of software reliability in the space program. Efforts have focused on the process known as "software engineering", which seeks to develop standards and methodology that tend to maximize software reliability (Lyu). Many theorists contend that highly complex software, such as that needed to control interplanetary probes, can never be entirely "bug-free" (Radin). Indeed, it is those "bugs" which have come to the surface that grab the attention of the world, when a mission fails due to an unforeseen malfunction in a computer program.

History

The earliest known example of a spacecraft failure directly attributable to software was the loss of Mariner 1. On July 21, 1962, the spacecraft bearing America's first probe to Venus exploded shortly after liftoff due to a misplaced comma in the control program of the Atlas-Aegena booster (Gilmore, 1987). Due to poor antenna design, the radio signal from ground controllers was lost and an untested computer program took over control of the rocket. The misplaced comma caused the program to misinterpret minor fluctuations in the space vehicle's direction as major ones. A "negative feedback loop" then caused wild fluctuations in rocket direction and necessitated the destruction of the booster and probe (Forester, 1990). In 1988, the Soviet's first Mars mission, Phobos 1 was lost due to the inability of its control program to correct erroneous data from ground control (Veldhuizen, 1997). Russian controllers sent a long sequence of commands with a single error, a plus symbol where a minus symbol belonged. That error was not corrected, causing the probe to shut down its pneumatic systems, lose orientation and begin to spin uncontrollably (Failure). Another more recent launch failure attributable to a software glitch was the Ariane 5, flight 501. The spacecraft had to perform a self-destruct 40 seconds into the launch. The problem was traced to a software exception, which was generated by the control program. The cause of the exception was the improper conversion of a 64 bit floating point value to an unsigned 16 bit integer, causing an overflow condition (Lions, 1996). Perhaps the best known interplanetary mission of recent years, the Pathfinder probe, was plagued by software problems. The most serious was a problem with the lander's computer resetting itself while performing multiple tasks. Several days worth of data was lost due to these unexpected resets. The resets were caused by the incorrect termination of a high priority task by a lower priority task. The software team was able to rewrite and patch the lander's control software in time to carry out the remainder of the mission. The modification of a single command solved the problem (Mars 1997). Another planetary survey mission, the Mars Global Surveyor was nearly scuttled due to a software error. An unintended command from a mission control program placed one of the solar arrays into an improper position just before Mars orbital insertion (CNN, 1998). Later the array was discovered to be buckling during an aerobraking maneuver. As a result, modifications had to be made to the mission, which resulted in a delay of a full year before the primary mission of mapping Mars could begin. The Galileo mission, now orbiting Jupiter, was nearly lost due to a combination of electrical glitch and software error. The error placed the spacecraft into "safe mode" and halted further observation of Jupiter's moon Europa, until further instructions were sent from Earth. The operating system of the main control computer was later modified to recognize the error condition and continue observations without shutting down the spacecraft (Upgraded, 1999). These are but a small cross section of the problems caused by software error in the short history of interplanetary exploration. Similar stories pervade the rest of the space program, the space shuttle in particular. What these stories point to is the need for greater efforts to improve the reliability of software in the space program.

Discussion

In general, it has always been the task of computer scientists and engineers in the space program to attempt to eliminate the sources of error from the software that controls spacecraft. While the probe missions to the planets are not directly life threatening, their successors, manned missions to Mars and beyond, will certainly have the added priority of protecting human life. The environment of interplanetary space poses a special challenge to computer hardware and software engineers, due to high levels of radiation caused by solar wind and around some of the planets such as Jupiter.

An experiment placed on the space shuttle attempts to address computer software reliability in space. Using a Z-80 processor, the experiment designed by Martin Pechanec tests the theory that errors generated by high energy particles in space can be resolved by sophisticated programming techniques as well as hardware error correction circuitry (Pechanec). Pechanec used a device called a "fault-tolerant controller" to perform on-the-fly correction of space induced faults in computer memory. Preliminary results indicated that such techniques can solve at least some of the reliability problems in space-borne computers. Just such a problem surfaced in the Galileo probe. It was discovered that a portion of the computer memory used to store and process images in one of Galileo's 18 operational computers had developed a fault. (Galileo, 1998). In order to maintain the functional integrity of the computer and to ensure data accuracy, a patch program was written that would avoid using the damaged memory (Up, 1994). Such "on-the-fly" software repair is a common element of programming for interplanetary probe missions.

From the perspective of the software engineer, the problem of avoiding errors is multifaceted. Every phase of software design and implementation is an opportunity for incipient error. The software glitch in the Mariner 1 incident was introduced during specification stage. The formula used to determine velocity variations was supposed to be averaged out over a period of seconds. Instead, the specification given to the programming team was that the velocity measurements would be instantaneous (i.e. differences were calculated from sample to sample rather than averaged). The result was the comma in the computer code instead of a period. A simple re-check of the specification would have brought the problem to light and saved the mission (Gilmore).
In his exploration of the causes of software defects, Todd Veldhuizen suggests that the best approach to software error detection is the employment of multiple methods on the same piece of software, such as prototyping, formal code inspections and field testing (Veldhuizen, 1997). He focuses on the idea that efforts should be concentrated on the early stages of the software life cycle, especially in the specification phase. The approach advocated by Fred Brooks, a noted computer scientist, in his book "The Mythical Man-Month" is radical compared to Veldhuizen. Brooks suggests that managers of large programming projects should expect to implement a given program twice before creating a reasonably bug-free program (Brooks, 1995). In his theory, called the "second-system effect", he states that it is very rare for the first implementation of a program to be reasonably bug-free. A second implementation is almost always closer to optimal. Brooks also states that the design and debugging phases of a complex program are almost universally too short. It is clear from the history of space flight that the software written for interplanetary probes can benefit from the programming practices advocated by Brooks and Veldhuizen, especially in the planning and verification stages.

Conclusions

Throughout the sixty year history of digital computers, finding and eliminating errors has been the holy grail of programmers. In the space program, glitches in software can have drastic, even catastrophic results. It is unclear as to whether software errors have been to blame for any fatalities in the space program, but certainly the possibility exists for that in the future. Tom Forester suggests that we must come to accept that computers will never be completely reliable, even in known life-critical scenarios (Forester, 1990). In spite of this grim prognosis, it is clear that improvements, even radical ones can and must be brought to bear on the software creation process. The main space shuttle computer has over 500,000 lines of program code, any one of which could cause or contribute to a major malfunction. If the space program is to continue its progress, new and better methods of sifting through the haystack of computer code must be found.

Last updated
All code, editorial content and images copyright © 1994-2008, Bruce Neufeld, unless otherwise stated.

Return to Home Page
Go Back