Official Post-Mortem
Event Start: Wednesday, December 4th, 3:17pm Eastern
Event Resolution: Wednesday, December 4th, 4:45pm Eastern
Green Cloud Incident Number: 917731
Summary
A configuration change during routine provisioning activity triggered an unexpected response in a new vCloud Director version that led to a disruption in Internet connectivity for those partners utilizing non-VMware Edge Gateway devices (e.g. Cisco ASAv).
Timeline of Events
3:17pm Eastern: A Green Cloud team member made a configuration change as part of documented, routine provisioning activity on a device used for IP address management in Green Cloud’s vCloud Director environment. This type of change is very normal for provisioning, and the process was unchanged. No work error was made.
3:28pm: Green Cloud started receiving initial reports of a network interruption in Houston.
3:32pm: Initial Statuspage notification sent, Cloud Infrastructure and Engineer teams began investigation. At this time, all network peers were online and there were no network device failures discovered.
4:05pm: Vendor ticket was opened.
4:24pm: Upon reviewing diagnostic and transaction logs, Green Cloud engineers determined that the IP management appliance was unexpectedly responding to all allocated IP addresses, which is abnormal behavior, based on current build standard and tested configurations. The team mitigated this behavior by blocking the network ports on this device until root cause could be determined.
4:38pm: Port blocking was completed and initial signs of resolution were observed.
4:42-4:45pm: Partners confirmed issue resolution and Statuspage notice was sent.
4:45-6:15pm: Green Cloud was actively monitoring all systems to confirm no recurrence of event and incident was closed after 1.5 hours of no ongoing impact.
Root Cause Analysis
Upgrading vCloud Director from 8.20 to 9.5 introduced a previously unknown behavior change that was triggered by the configuration change to the IP management appliance in vCloud, causing it to use reserved IPs, resulting in a large scale IP conflict on those subnets. Prior to upgrading, Green Cloud had not seen this behavior, and it wasn’t documented. Green Cloud reviewed the release notes for the upgraded version and this behavior change was not noted there.
Remediation
In the short term, for planned vCloud 9.x upgrades, methods of procedure (MOPs) are to be updated to include disabling the IP management appliance internal network connections to prevent the IP conflicts. For full remediation, the overall IP management configuration will be reviewed and updated for all data centers to account for this new behavior. Additionally, the lab testing for future vCloud Director updates will now also include testing specific IP provisioning tasks to ensure that no future versions have such an impact when put into production.