X

AH Tech Talk: Google's Cloud Crash Bad News

On Monday, April 11 of 2016 at about 7 PM Pacific Time, Google’s cloud service, Google Compute Engine, went completely dark. The outage lasted almost twenty minutes. The outage was worldwide, on all servers. No Google Compute Engine customer anywhere in the world was able to use the services, meaning that countless server backends were rendered utterly useless during the outage. Files were lost, settings were reset and all the other trappings of a backend outage visited all of Google Compute Engine’s customers. This happened despite Google having systems in place to prevent exactly this sort of thing from happening, including numerous security protocols on the hardware and software level, as well as multiple backup servers. Naturally, this put serious egg on Google’s face at a time when they’re struggling to gain ground on Microsoft Azure and Amazon Web Services. For an outage of this scale to occur, something major had to have happened. As it turns out, Google published a long-winded explanation.

Along with an apology and an assurance that they knew what had happened and that Google Compute Engine was not at risk of another outage, Google published a status report showing exactly how everything went down and the steps they had taken to fix it and prevent it, or other crashes, from happening again. In the paper, Google goes roughshod over how their cloud service network does its thing. Essentially, the services announce their nodes for customers to find through standard internet protocols. These announcements are made through multiple clusters of internet addresses, letting customers’ machines find the shortest and most reliable path to a Google server. One of these clusters wasn’t active and, if plugged into by a remote machine, would simply reject them and send them looking for a different node. Since that caused some network inefficiency, Google told the control program to remove it from all of the configuration files in the network. The main configuration file was updated, but another on the same machine was not, resulting in an inconsistency. Rather than simply pushing the new configuration or reverting to the last known good configuration as it was programmed, the control software experienced a bug that caused it to create a new configuration with no IP clusters at all and push the new configuration all over the system. With no clusters being announced, customers’ machines couldn’t find a place to hook into Google’s servers, leaving them unable to connect. Since the announcements were only needed for outside connections, none of Google’s other cloud services or mainline services went down, since they could still talk to the servers. The bug, as it turns out, caused a chain reaction in a failsafe software that would normally have caught the iffy configuration before it left the first machine.

Not long after the bad configuration file started making the rounds, Google began receiving alerts. First, a southeast Asian server up and quit due to the number of total announcements decreasing, which left it with nothing to piggyback off of. From there, Google started receiving reports of high latency due to users getting directed to servers that were further away than their normal connection points. Just under an hour after the configuration that stopped the announcements was initially pushed, Google engineers were able to trace it and already knew exactly what had happened. Under twenty minutes later, all announcing servers had gone quiet, leading to the outage. As soon as they had figured out what happened, Google engineers threw the switch to revert all servers to the last good configuration. In total, the configuration took about twenty minutes to go through to all servers, which was about how long the outage lasted. Thus, the bug took hold, spread to all announcing servers, was found and squashed, then the servers were brought back online using the last known good configurations. The whole thing happened in less than two hours time, which is lightning fast in terms of worldwide servers. So now, all eyes were on the original configuration control software bug that caused the whole mess.

The software bug with the controller was squashed and the software was remade, eliminating any possibility of the exact same bug happening. To prevent a bug of similar nature, Google’s engineers had, at the time of the report’s publishing, already made 14 sweeping changes to the network. Among those changes, the biggest ones were a new software layer designed to monitor all servers and watch for any that stop announcing, as well as a system that double-checks all configurations before they’re applied to their original machine, since this bug slipped the failsafe that normally applies the configuration and then checks the original machine for continued function. After the application of a configuration, announcements and network paths will be checked in detail for consistency, even if things look relatively normal. In essence, this bug or similar bugs that would be capable of taking out the entire network, as well as some forms of cyber attack as a welcome side effect, were completely safeguarded against.

Naturally, this incident decreased consumer confidence in both Google and the concept of a cloud backend. To run some damage control, Google included in the diagnostic announcement and apology, a promise to give customers a bill credit greater than the normal amount granted for such outages as specified in users’ contracts. Google also specified in their report that this kind of incident is to be taken extremely seriously and they will continue working to prevent this sort of thing in the future, with both proactive changes to the network software and preventive maintenance. With the incident safely behind them, Google’s cloud kept ticking on, but the fact that a Google service marketed on reliability and agility went completely dark for almost half an hour is a serious blow to Google’s chances at IaaS supremacy.