Designing Robust Software

by Jessica Brown

Rhymes With Orange Strip (1-2-2001) Y2K in Review... We should know better than to think we can predict WHEN a computer will crash or for WHAT reason.
Rhymes With Orange Strip (1-2-2001) Y2K in Review… We should know better than to think we can predict WHEN a computer will crash or for WHAT reason.

What does it mean for software to be robust?


  • Capable of performing without failure under a wide range of conditions (Merriam Webster)
  • Performs well not only under ordinary conditions but under unusual conditions that stress its designer’s assumptions (The Linux Information Project)
  • Accounts for and protects against certain classes of software (and/or hardware) errors or failures

Example: Unix is often said to be robust because it can operate for prolonged periods without crashing or requiring rebooting, and if individual programs crash, they usually do so without affecting other programs or the operating system.

Software is often instead:

  • buggy(error-prone)
    • doesn’t work how it “should” or hasn’t been tested thoroughly for correctness of the design and implementation
    • “This is in large part because programs are usually too big and too complicated for a single human mind to comprehend in their entirety, and thus it is difficult for their developers to be able to discover and eliminate all the errors, or to even be certain as to what extent of errors exist. This is especially true with regard to subtle errors that only make their presence known in unusual circumstances.” -The Linux Information Project”
  • brittle(fragile)
    • often an issue of scale-ability or unanticipated workloads (eg: the slashdot effect) – will your program be brought to its needs in this case?

Tactics to Deliver Robustness:

  • Use a Good Design
    • Mostly Business Cartoon: "It was my punishment for writing incomprehensible html" (pictured: man has feet where his arms should be and a hand where his feet should be.
      Mostly Business Cartoon: “It was my punishment for writing incomprehensible html” (pictured: man has feet where his arms should be and a hand where his feet should be.

      Design for maintenance – Code that is easy to read (designed for comprehension) and/or written in a common/mainstream programming language will be easier to maintain if the original experts are no longer available. If your lead designer got hit by a bus tomorrow would you be able to figure out how the code works?

    • Use Modularity – small modular programs are easier to comprehend and correct than larger ones that do many things. Breaking code into modular chunks makes each one more likely to be stable. Don’t write spaghetti code.
    • Simplicity – being able to visualize and comprehend how code works makes it much easier to identify all the potential situations that might be encountered (this is one of the reasons robustness should be designed and planned into the software from the start rather than attempted to retrofitted into robustness after the fact).
    • Avoid special cases – write general code that can accommodate a wide range of situations without a lot of special cases whenever possible because code to handle special cases that is rarely executed is more likely to be buggy because it is less executed/tested.
    • Good business rules – applications need to process transactions correctly not just in terms of how but also in terms of what is expected and allowed. Evaluate the program’s correctness according to what is expected.
    • Choose an appropriate language – decide whether the language the code will be written in is suitable for the problem at hand/robustness needs. Some languages are inherently more suited for robust applications than others
      • Strong typing – A language with strong typing not weak typing for example can increase robustness by identifying constraint violations at compile time rather than runtime. Strongly typed languages (especially if they prevent evading the typing system) allow for considerably more compile-time checking against accidentally doing stupid things in your code than weakly typed languages.
      • Avoid direct memory access – Doing bad pointer math or memory arithmetic can crash things. Java detects and throws exceptions if you try to access a non-existent array item, many languages such as C++ may silently accept bad data.
      • Conciseness – Does the language reduce how much boiler-plate code you have to write? The less lines of code you have to write to implement a given feature, the less places there are for a mistake to be introduced.
      • Is the language itself buggy or thoroughly tested and stable?
      • Does the language provide good testing tools and needed code access by testing packages?
    • Anticipate and isolate concerns in the original design – reduce risk by anticipating areas that could cause failure and isolate those problems as much as possible.
  • Prevent Cascading Failures – one bad piece of code shouldn’t break everything else.
    • Mostly Business Cartoon: Man on tall ladder adding another tally under preventable accidents, meanwhile unpreventable accidents down low is blank.
      Mostly Business Cartoon: Man on tall ladder adding another tally under preventable accidents, meanwhile unpreventable accidents down low is blank.

      Build in Safeguards

      • Process Isolation – don’t let one application overwrite/corrupt another application or the OS’s memory
      • Permissions – using appropriate permissions for accessing files/directories/etc. to make it more difficult for sloppy or malicious code to affect other parts of the system.
      • Avoid Global Variables – variables should be as localized as possible, avoid global variables. The less pieces of code that have access to a variable, the less code that can incorrectly muck with the variable.
      • Variable Validation before Assignment – Use Setter methods that do validation rather than blindly assigning values.
      • Use defensive programming – do lots of checks on whether the input contains valid data – Be strict about outputs, forgiving about inputs
    • Design for Fault tolerance – avoiding catastrophic failures
      • Detect and recover from errors
        • Ways to handle errors (generally) are terminate or resume.
        • ignore the operation generating the error
        • use a pre-defined response of what should happen in that error case
      • Check for invalid inputs and handle gracefully (don’t crash)
      • Don’t propagate errors – Good exception handling prevents errors from cascading all the way up to crashing your application. Instead, catch and handle errors as close to the source of the problem as possible.
    • Loose Coupling – Reduce dependencies on other things that could break. Use loosely coupled highly cohesive packages
  • Thorough Testing
    • Visual Code Review (More eyes are better) – code that is reviewed by diverse sets of programmers are more likely to find (and correct) errors than code written by a single person under deadline pressure. Being open source where possible can be a boon.
    • Use test-suites and regression tests to identify potential problems.
    • Use static analysis to identify defects at compile time
      • use LINT type tools to identify”stupid but legal” code
  • Expect the Unexpected – Consider the different parts of your system that could cause exceptions or crashes or otherwise causing a lack of stability and plan for the what-ifs.
    • Consider Hardware crashes – Handle crashes/hardware failures/OS crashes/etc. gracefully – autosave prevents fragility of losing your work if the app crashes
    • Redundancy – Avoid single point of failure problems AND handle such failovers gracefully.
    • Check for and handle exceptions

“Exception handling code can be difficult to represent in terms of design and documentation, largely because it generally falls outside normal program flow, and can occur at virtually any point in a program.”

When to Build Robust Software

Design Goals are not always for Robustness:

Rapid Application Design Robust Design
Good for:

  • Proof of concept
  • to assess feasibility
  • Demos/things that aren’t built to last

Faster to implement typically (less worrying about the possible “what ifs” of what could go wrong)

When code needs to:

  • work under all circumstances
  • be stable and not crash
  • handle unexpected cases gracefully

Requires more thorough testing to verify as many errors as possible are avoided