On 23 November 2022, from 05:00 to 11:00 UTC, 11:11 engineers conducted a scheduled maintenance in the Atlanta data center to mitigate a possible VMware bug. This condition caused NSX to be out of sync on certain hosts in the IaaS cluster. 11:11 engineers worked with VMware and received a set of maintenance instructions which included steps to redeploy Distributed Logical Router appliances (DLR) and re-sync network appliances and firewalls by moving them to a new port group and then back to the original port group.
Starting at 10:20 UTC, customers began experiencing issues accessing their environments. Engineers determined that the incident was due to a loss of VMware NSX SpoofGuard configuration. The NSX SpoofGuard feature is used to prevent IP conflict by maintaining an IP to MAC address database. When a network appliance or firewall is disconnected from one port group, this mapping is automatically deleted. Because the mapping was not available in SpoofGuard, customer environments became inaccessible when the appliances were returned back to their original port groups.
The team worked to identify all impacted customers and reapproved their associated IP to MAC address mapping. Engineers were able to quickly verify primary IPs through VMware, but secondary IPs required additional work and effort to identify. This work was completed by 16:18 UTC and engineers verified that most customers would be restored at this point. Only customers with unique secondary IP configurations or impact unrelated to SpoofGuard may have still seen issues.
The incident was left in monitoring over the weekend to allow for customers with edge-case impact to call in for assistance.
Root Cause:
On 30 November 2022, engineers conducted a Root Cause Analysis (RCA) investigation. The purpose of this session was to review the key events and issues associated with this incident, identify potential causes, confirm root cause and develop actions to address the root cause(s).
Engineers determined that the outage was due to the maintenance process provided by VMware to cause IP to MAC mapping to be lost in SpoofGuard because they were moved from one port group to another and then back.
Action Items: