Networking issue in Atlanta
Incident Report for Green Cloud Defense
Postmortem

On 23 November 2022, from 05:00 to 11:00 UTC, 11:11 engineers conducted a scheduled maintenance in the Atlanta data center to mitigate a possible VMware bug. This condition caused NSX to be out of sync on certain hosts in the IaaS cluster. 11:11 engineers worked with VMware and received a set of maintenance instructions which included steps to redeploy Distributed Logical Router appliances (DLR) and re-sync network appliances and firewalls by moving them to a new port group and then back to the original port group.    

Starting at 10:20 UTC, customers began experiencing issues accessing their environments. Engineers determined that the incident was due to a loss of VMware NSX SpoofGuard configuration. The NSX SpoofGuard feature is used to prevent IP conflict by maintaining an IP to MAC address database. When a network appliance or firewall is disconnected from one port group, this mapping is automatically deleted. Because the mapping was not available in SpoofGuard, customer environments became inaccessible when the appliances were returned back to their original port groups.

The team worked to identify all impacted customers and reapproved their associated IP to MAC address mapping. Engineers were able to quickly verify primary IPs through VMware, but secondary IPs required additional work and effort to identify. This work was completed by 16:18 UTC and engineers verified that most customers would be restored at this point. Only customers with unique secondary IP configurations or impact unrelated to SpoofGuard may have still seen issues. 

The incident was left in monitoring over the weekend to allow for customers with edge-case impact to call in for assistance.

Root Cause:

On 30 November 2022, engineers conducted a Root Cause Analysis (RCA) investigation. The purpose of this session was to review the key events and issues associated with this incident, identify potential causes, confirm root cause and develop actions to address the root cause(s).

Engineers determined that the outage was due to the maintenance process provided by VMware to cause IP to MAC mapping to be lost in SpoofGuard because they were moved  from one port group to another and then back.

Action Items:

  • Engineers are working to develop a process for creating backups of SpoofGuard configuration information.
  • Engineers are continuing to work with VMWare on permanent fix for the potential new bug.
  • Engineers are updating our internal documentation to include this type of failure condition to prevent impact during future maintenances.
Posted Nov 30, 2022 - 17:57 EST

Resolved
After monitoring the environment over the weekend, 11:11 engineers have confirmed that the issue has been resolved. Customers who experience known continued impact as result of the maintenance have been restored.
Posted Nov 28, 2022 - 12:50 EST
Monitoring
11:11 Systems engineers have resolved all reported customer issues related to this incident and there should not be any risk of recurrence. This issue affected 3rd party firewalls or appliances like ASAv's or Sonicwalls, and we ask that all partners confirm with each customer using any of those devices that they are working properly. We do believe there will be a number of devices with secondary NAT IP addresses that are still impacted and we are working on ways to proactively address those. However, if you know you have a service using a secondary NAT IP on your device and the service that uses it is down, please write in ASAP and we can fix immediately.

We are placing this incident in monitoring status and will leave it that way over the US holiday weekend but please email support@greenclouddefense.com as soon as possible if you believe that you have continued impact.
Posted Nov 23, 2022 - 11:27 EST
Update
We are continuing to work on resolving the Spoofguard issue and are seeing recovery for customers as we work through all affected devices. Please reach out to support@greenclouddefense.com if you have any questions.
Posted Nov 23, 2022 - 09:45 EST
Identified
Green Cloud has identified an issue with the Spoofguard functionality of our underlying NSX network platform that has caused customer routers and firewalls to lose connectivity if using 3rd party devices in the Atlanta cloud environment. We are working on resolving this as quickly as possible but please email support@greenclouddefense.com if you have any questions.
Posted Nov 23, 2022 - 08:12 EST
This incident affected: Network (Network - Atlanta, GA) and IaaS (IaaS - Atlanta, GA).