Event Start: Thursday, October 3rd, 8:55am Eastern
Event Resolution: Thursday, October 3rd, 9:33am Eastern
Summary
During preparation for IP address space growth in Houston’s IaaS environment, a work error was made that led to a disruption in Internet connectivity for those partners utilizing virtual routers (e.g. Cisco ASAv).
Timeline of Events
8:00-8:30am - A Green Cloud engineer prepared to add a new external network (in order to provide additional IP address capacity) to the Green Cloud public cloud (IaaS) infrastructure in Houston by following the published Method of Procedure’s (MOP’s) pre-change steps. A work task specified in the MOP for a VLAN configuration was missed during this preparation.
8:55am – During the roll-out stage of the MOP, a Green Cloud engineer added the new external network to the VMware NSX policy. (Green Cloud has an open case with VMware to understand the specific series of events that followed, but symptoms indicate that because the VLAN configuration wasn't valid for the external network, the NSX policy was invalidated for all external networks, causing an immediate outage for those virtual routers in Houston’s IaaS environment).
9:33am – After receiving multiple Partner reports of connectivity loss, and escalation of the master Incident to the Cloud Infrastructure team, the invalid policy configuration was discovered and the network change was reverted. This action restored connectivity to all affected virtual routers in Houston.
Root Cause Analysis
Green Cloud is working closely with VMware support to confirm root cause, but it appears based on our investigation that the incorrect VLAN configuration caused the new external network to be marked as invalid, which then caused NSX policies that included this new network to be invalidated. Backing out that configuration change restored service.
Remediation
Green Cloud will continue to work with VMware to understand the catalyst for the issue and adjust procedural documentation and training where appropriate to ensure this issue does not recur. Additionally, all scheduled and planned maintenance that involves growing IP address space will be escalated for peer review and will be subject to approval through the Change Advisory Board, at least until root cause analysis is completed.