Investigating Service Disruption in Houston

Incident Report for 11:11 Systems

Postmortem

Official Post-Mortem

Event Start: Wednesday, December 4th, 3:17pm Eastern
Event Resolution: Wednesday, December 4th, 4:45pm Eastern
Green Cloud Incident Number: 917731

Summary

A configuration change during routine provisioning activity triggered an unexpected response in a new vCloud Director version that led to a disruption in Internet connectivity for those partners utilizing non-VMware Edge Gateway devices (e.g. Cisco ASAv).

Timeline of Events

3:17pm Eastern: A Green Cloud team member made a configuration change as part of documented, routine provisioning activity on a device used for IP address management in Green Cloud’s vCloud Director environment. This type of change is very normal for provisioning, and the process was unchanged. No work error was made.

3:28pm: Green Cloud started receiving initial reports of a network interruption in Houston.

3:32pm: Initial Statuspage notification sent, Cloud Infrastructure and Engineer teams began investigation. At this time, all network peers were online and there were no network device failures discovered.

4:05pm: Vendor ticket was opened.

4:24pm: Upon reviewing diagnostic and transaction logs, Green Cloud engineers determined that the IP management appliance was unexpectedly responding to all allocated IP addresses, which is abnormal behavior, based on current build standard and tested configurations. The team mitigated this behavior by blocking the network ports on this device until root cause could be determined.

4:38pm: Port blocking was completed and initial signs of resolution were observed.

4:42-4:45pm: Partners confirmed issue resolution and Statuspage notice was sent.

4:45-6:15pm: Green Cloud was actively monitoring all systems to confirm no recurrence of event and incident was closed after 1.5 hours of no ongoing impact.

Root Cause Analysis

Upgrading vCloud Director from 8.20 to 9.5 introduced a previously unknown behavior change that was triggered by the configuration change to the IP management appliance in vCloud, causing it to use reserved IPs, resulting in a large scale IP conflict on those subnets. Prior to upgrading, Green Cloud had not seen this behavior, and it wasn’t documented. Green Cloud reviewed the release notes for the upgraded version and this behavior change was not noted there.

Remediation

In the short term, for planned vCloud 9.x upgrades, methods of procedure (MOPs) are to be updated to include disabling the IP management appliance internal network connections to prevent the IP conflicts. For full remediation, the overall IP management configuration will be reviewed and updated for all data centers to account for this new behavior. Additionally, the lab testing for future vCloud Director updates will now also include testing specific IP provisioning tasks to ensure that no future versions have such an impact when put into production.

Posted Dec 11, 2019 - 15:30 EST

Resolved

Green Cloud has seen no additional issues with partners for the last 1.5 hours, so we will resolve this incident. We are going to continue to investigate with our vendor for root cause and permanent fixes, and we will be publishing a postmortem here on our Status page when complete. If you are experiencing any ongoing impact, please reach out to our Partner Support team at support@gogreencloud.com.
Posted Dec 04, 2019 - 18:15 EST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Dec 04, 2019 - 17:11 EST

Identified

Green Cloud has identified the root cause of the issue and has performed initial mitigation steps. We are seeing recovery at this time. Please check your service and make sure all is well, and if not, please let us know at support@gogreencloud.com
Posted Dec 04, 2019 - 16:45 EST

Update

Green Cloud is continuing to investigate the networking issue in Houston. As soon as the issue is identified we will let you know. Please email support@gogreencloud.com with any questions.
Posted Dec 04, 2019 - 16:04 EST

Investigating

The Network Operations Center is investigating a service disruption in our Houston datacenter that seems to be peer or network related. More details will be provided in updates as soon as we narrow down the impact.
Posted Dec 04, 2019 - 15:32 EST
This incident affected: Network (Network - Houston, TX) and IaaS (IaaS - Houston, TX).