Resiliency and Operations Lessons from the Apollo Missions to the Moon
Abstract
July 2019 marked 50 years since the first humans set foot on the moon.
Moments before the historic landing of Apollo 11, the spacecraft computer started throwing errors and basically restarting every few seconds. The astronauts and mission control had to make split-second decisions and trust in the resiliency and reliability of the systems.
There are many lessons that can be learned from NASA’s efforts and in this session I will present a selection of lessons in the domains of resilience, reliability, performance and availability.
After each mission, all the problems that occurred were collected as Flight Anomalies. They were then analyzed, causes were found and remediation was prepared so the problems would never recur.
In this session I will present a number of stories about Flight Anomalies from the Apollo era and explain their relevance, as learning stories for the modern era of computing.
These anecdotes will also include both software (rebooting computer) and hardware issues (the Instrument Unit, the computer brain of the Saturn V rocket, survived a lightning strike during launch!).
Speaker
Robert is a Senior Managing Consultant and member of the IBM Garage Solution Engineering group. Within the worldwide Garage Solution Engineering group, he is part of the Cloud Service Management and Operations (CSMO) team, working in all fields of CSMO and specializing in Site Reliability Engineering (SRE) and Chat Operations (ChatOps).
Robert joined IBM in 2007 and has held various positions in IBM, all in the field of Service Management. In total, he has over twenty years of experience in enterprise systems in multiple domains spanning development, technical leadership, project management and offering management.
Robert speaks at global conferences for IBM and creates assets that range from internal documentation to published books.