Investigating Internet Connectivity Disruption in Houston Datacenter
Incident Report for Green Cloud Defense
Postmortem

Event Start: Thursday, October 3rd, 8:55am Eastern
Event Resolution: Thursday, October 3rd, 9:33am Eastern

Summary

During preparation for IP address space growth in Houston’s IaaS environment, a work error was made that led to a disruption in Internet connectivity for those partners utilizing virtual routers (e.g. Cisco ASAv).

Timeline of Events

8:00-8:30am - A Green Cloud engineer prepared to add a new external network (in order to provide additional IP address capacity) to the Green Cloud public cloud (IaaS) infrastructure in Houston by following the published Method of Procedure’s (MOP’s) pre-change steps. A work task specified in the MOP for a VLAN configuration was missed during this preparation.

8:55am – During the roll-out stage of the MOP, a Green Cloud engineer added the new external network to the VMware NSX policy. (Green Cloud has an open case with VMware to understand the specific series of events that followed, but symptoms indicate that because the VLAN configuration wasn't valid for the external network, the NSX policy was invalidated for all external networks, causing an immediate outage for those virtual routers in Houston’s IaaS environment).

9:33am – After receiving multiple Partner reports of connectivity loss, and escalation of the master Incident to the Cloud Infrastructure team, the invalid policy configuration was discovered and the network change was reverted. This action restored connectivity to all affected virtual routers in Houston.

Root Cause Analysis

Green Cloud is working closely with VMware support to confirm root cause, but it appears based on our investigation that the incorrect VLAN configuration caused the new external network to be marked as invalid, which then caused NSX policies that included this new network to be invalidated. Backing out that configuration change restored service.

Remediation

Green Cloud will continue to work with VMware to understand the catalyst for the issue and adjust procedural documentation and training where appropriate to ensure this issue does not recur. Additionally, all scheduled and planned maintenance that involves growing IP address space will be escalated for peer review and will be subject to approval through the Change Advisory Board, at least until root cause analysis is completed.

Posted Oct 10, 2019 - 14:11 EDT

Resolved
Green Cloud has monitored this issue all day and has seen no recurrence. Please reach out to our Support team if you have any ongoing impact. We will be providing a post-mortem here on our Status page in the coming days. Let us know if there's anything else we can assist with and we are deeply sorry for the impact today's event caused you.
Posted Oct 03, 2019 - 17:04 EDT
Monitoring
Green Cloud is still investigating the root cause, but all customers have reported full restoration of service. We're continuing to monitor at this time. Please reach out to Support if you have any questions or issues.
Posted Oct 03, 2019 - 10:15 EDT
Update
Green Cloud is continuing to investigate this issue, but we are getting reports of recovery. Please verify your systems are back up and email support@gogreencloud.com if not. We will be keeping this incident open until we've identified the root cause.
Posted Oct 03, 2019 - 09:43 EDT
Investigating
The Network Operations Center is investigating a service disruption affecting IaaS and Private Cloud internet connectivity in our Houston datacenter. More details will be provided in updates to follow within the next 30 minutes.
Posted Oct 03, 2019 - 09:27 EDT
This incident affected: Network (Network - Houston, TX), IaaS (IaaS - Houston, TX), and DaaS (DaaS - Houston, TX).