Date of Incident: May 13, 2016
Start of Incident: 1:25am
End of Incident: 9:30am (most issues clearing at 6:00am)
Services Effected: Greenville Datacenter based IaaS, DRaaS, DaaS, and BaaS
All timestamps are in Eastern Time unless otherwise noted.
Friday May 13 in the early morning hours we failed in our goal of providing continuous uptime for our customers when our Greenville, SC data center network failed during an announced and planned maintenance window. All maintenance procedures were followed per our change control process, but as often is the case the unexpected happened. Avoiding downtime is ironically the reason for maintenance windows, and proactive maintenance of the network. Everyone at Green Cloud takes issues like this to heart, and everyone here will be working closely to decrease the possibility of customer impact such as this event in the future. We are all truly sorry for the impact this had on our customer's operations. Below you will find a description of the event, and what steps are being taken to improve our performance in the future.
At 1:25am during scheduled maintenance of our network involving preparations for a software upgrade of our core switches we experienced a software failure. The secondary core switch in our Greenville, SC datacenter entered a misconfigured state after a planned reboot. This misconfiguration blocked jumbo frame traffic between our VMware compute hosts and our storage platforms effectively severing communications and taking the clusters offline. The misconfiguration was not easily identified because although the switch displayed the relevant configuration as active, it had not been activated at boot. This state also triggered a mismatch in configuration between the redundant switches causing the primary switch in the pair to also stop passing traffic. Our Network Operations team performed reboots and fallback attempts as per our change control guidelines, but the issue did not clear. At 2:30am vendor support was engaged. At 4:30am the vendor identified an issue similar to the one we were experiencing, and had us re-enter the commands enabling jumbo frame support. Upon entering these commands traffic flows returned.
Even though connectivity was re-established, the VMware hosts and all associated virtual machine guests had been isolated from their storage for an extended period of time. As such, many of the hosted virtual machines had entered an erred state at the operating system level. Since it is common for Linux-based systems to enter a read-only mode in a storage failure of this type, an automated reboot of all of the Linux based gateways such as vShield edge, ASAv, and CSR1000V platforms was initiated. This automated reboot cleared many of the connectivity issues for affected customers between 5:30am and 6:00am. As it was not possible to easily verify the correct operation of the operating systems inside all customer-managed VMs, some customers experienced lingering connectivity issues during the window of time from the initial clearing until approximately 9:30am when the last customer issues were cleared.
To reiterate, the incremental nature of the return to service meant some customers were fully available at the initial remediation of 4:30am, while others came up as they were remedied by automated or reactive actions of our Customer Operations team at various times up to the 9:30am timeframe.
1) Communication: Many partners and customers have noted they were not aware of the maintenance activities or the associated outage. We will be actively communicating the process for signing up for notifications and the location of the status page to existing partners and customers to make sure everyone is aware of both in the future. We will also be adding a step during new partner and customer on-boarding to better communicate the process.
2) Recovery automation: We will be working to extend our automated procedures for recovering virtual machines in a unavailable state. Fortunately, much was learned to better our performance in the future.
3) Upgrade of effected gear: This maintenance was actually a precursor to scheduled core switch upgrades, so those upgrades will now be conducted to remediate the issue encountered as well as any other vendor corrected bugs/issues.