Can Cloud Services Be Reliable After Google’s Major Power Outage?

October 25, 2024

On October 24, a significant power outage impacted Google Cloud’s europe-west3 region, located in Frankfurt, Germany, causing a 12-hour and 39-minute disruption that began at 02:30 local time and ended at 15:09. The outage affected a wide range of services, leading to failures in the creation of virtual machines (VMs), delays in processing deletions, and inaccessibility of certain instances. The root of the problem was identified as a power failure and cooling issue that resulted in a shutdown of parts of one of the region’s three zones, specifically europe-west3-c. This outage had far-reaching consequences, underscoring the vulnerabilities that cloud services still face despite advanced technologies.

The effects of the outage were felt across multiple Google Cloud offerings, including Cloud Build, Cloud Developer Tools, Cloud Machine Learning, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Pub/Sub, Google Compute Engine, Google Kubernetes Engine, Persistent Disk, and Vertex AI Batch Prediction. Users experienced significant disruptions: VM creations failed, nodes within the Google Kubernetes Engine became inaccessible, and Persistent Disk instances were unreachable. Cloud Dataflow users faced delays in scaling workers and issues with job progression. Additionally, Google Cloud Dataproc cluster creation efforts largely failed, while Cloud Build users encountered significant wait times for starting custom worker pools. This massive interruption affected businesses and developers, highlighting the critical need for robust redundancy and contingency measures.

While Google engineers eventually implemented a fix to restore operations, multi-zonal issues persisted, particularly affecting Vertex AI Batch Prediction. This failure impaired the infrastructure needed for serving predictions, creating more complications for users relying on these services. Communication from Google began 26 minutes after the outage started, but it was almost three hours into the disruption before any workaround was suggested. Users were advised to migrate workloads to other regions or zones and to take regular snapshots of degraded regional persistent disks. The effectiveness of this advice, however, may have been limited by the time required to execute such migrations, particularly in high-demand operational environments.

A Spotlight on Vulnerabilities and Lessons Learned

On October 24, a major power outage struck Google Cloud’s europe-west3 region in Frankfurt, Germany, disrupting services for 12 hours and 39 minutes from 02:30 to 15:09 local time. This incident primarily affected europe-west3-c, one of the region’s three zones, leading to failures in virtual machine (VM) creation, delays in deletions, and inaccessible instances. The root cause was identified as a power failure and cooling issue that forced partial zone shutdowns, highlighting vulnerabilities in cloud services.

The outage had widespread effects across various Google Cloud services, such as Cloud Build, Cloud Developer Tools, Cloud Machine Learning, Google Compute Engine, Google Kubernetes Engine, Persistent Disk, Vertex AI Batch Prediction, and more. Users experienced significant disruptions: VM creations failed, Kubernetes Engine nodes went down, and Persistent Disk instances were unreachable. Cloud Dataflow users faced worker scaling delays and job progression issues, while Cloud Dataproc cluster creation efforts largely failed. Cloud Build users had long wait times to start custom worker pools, affecting many businesses and developers.

While Google engineers eventually fixed the issues, multi-zonal problems persisted, notably affecting Vertex AI Batch Prediction. Communication from Google arrived 26 minutes after the outage began, but almost three hours passed before a workaround was suggested. Users were advised to migrate workloads to other regions or take regular snapshots of degraded regional persistent disks, though these measures weren’t immediately feasible for high-demand environments.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later