Services Outage in Greenville, SC datacenter
Incident Report for Green Cloud Defense
Postmortem

logo

Service Outage Postmortem

Eric Hester, VP Engineering and Operations

Date of Incident: May 13, 2016
Start of Incident: 1:25am
End of Incident: 9:30am (most issues clearing at 6:00am)
Services Effected: Greenville Datacenter based IaaS, DRaaS, DaaS, and BaaS
All timestamps are in Eastern Time unless otherwise noted.

Friday May 13 in the early morning hours we failed in our goal of providing continuous uptime for our customers when our Greenville, SC data center network failed during an announced and planned maintenance window. All maintenance procedures were followed per our change control process, but as often is the case the unexpected happened. Avoiding downtime is ironically the reason for maintenance windows, and proactive maintenance of the network. Everyone at Green Cloud takes issues like this to heart, and everyone here will be working closely to decrease the possibility of customer impact such as this event in the future. We are all truly sorry for the impact this had on our customer's operations. Below you will find a description of the event, and what steps are being taken to improve our performance in the future.

Incident Timeline

At 1:25am during scheduled maintenance of our network involving preparations for a software upgrade of our core switches we experienced a software failure. The secondary core switch in our Greenville, SC datacenter entered a misconfigured state after a planned reboot. This misconfiguration blocked jumbo frame traffic between our VMware compute hosts and our storage platforms effectively severing communications and taking the clusters offline. The misconfiguration was not easily identified because although the switch displayed the relevant configuration as active, it had not been activated at boot. This state also triggered a mismatch in configuration between the redundant switches causing the primary switch in the pair to also stop passing traffic. Our Network Operations team performed reboots and fallback attempts as per our change control guidelines, but the issue did not clear. At 2:30am vendor support was engaged. At 4:30am the vendor identified an issue similar to the one we were experiencing, and had us re-enter the commands enabling jumbo frame support. Upon entering these commands traffic flows returned.

Even though connectivity was re-established, the VMware hosts and all associated virtual machine guests had been isolated from their storage for an extended period of time. As such, many of the hosted virtual machines had entered an erred state at the operating system level. Since it is common for Linux-based systems to enter a read-only mode in a storage failure of this type, an automated reboot of all of the Linux based gateways such as vShield edge, ASAv, and CSR1000V platforms was initiated. This automated reboot cleared many of the connectivity issues for affected customers between 5:30am and 6:00am. As it was not possible to easily verify the correct operation of the operating systems inside all customer-managed VMs, some customers experienced lingering connectivity issues during the window of time from the initial clearing until approximately 9:30am when the last customer issues were cleared.

To reiterate, the incremental nature of the return to service meant some customers were fully available at the initial remediation of 4:30am, while others came up as they were remedied by automated or reactive actions of our Customer Operations team at various times up to the 9:30am timeframe.

Remediation steps

1) Communication: Many partners and customers have noted they were not aware of the maintenance activities or the associated outage. We will be actively communicating the process for signing up for notifications and the location of the status page to existing partners and customers to make sure everyone is aware of both in the future. We will also be adding a step during new partner and customer on-boarding to better communicate the process.

2) Recovery automation: We will be working to extend our automated procedures for recovering virtual machines in a unavailable state. Fortunately, much was learned to better our performance in the future.

3) Upgrade of effected gear: This maintenance was actually a precursor to scheduled core switch upgrades, so those upgrades will now be conducted to remediate the issue encountered as well as any other vendor corrected bugs/issues.

Posted May 16, 2016 - 09:36 EDT

Resolved
The issue has remained resolved since last update. There may be some lingering individual Virtual Machine issues facilitating a reboot of the VM to return it to service but there are no infrastructure issues preventing service operation. Operations is proactively investigating these issues and resolving as they are found. If you have any issues or questions please contact our support center at 1-877-465-1217 or support@gogreencloud.com.
Posted May 13, 2016 - 08:09 EDT
Monitoring
We continue to restore services. Most of the IaaS infrastructure is now available, Private Cloud and DaaS environments are being evaluated at this time as well. Most services expected online by 6am.
Posted May 13, 2016 - 05:50 EDT
Identified
Root cause has been identified and services are beginning to restore. We will update as restoration of services continues.
Posted May 13, 2016 - 04:50 EDT
Update
As part of scheduled maintenance activities on our core network switching equipment in our Greenville, SC data center we have experienced a failure preventing the flow of traffic between our vmware clusters and the storage networks. After extensive troubleshooting by our Engineering and Network Operations teams, we have now engaged our vendors to aid in troubleshooting and will provide another updates at least every 30 minutes as we progress.
Posted May 13, 2016 - 04:04 EDT
Investigating
Green Cloud is investigating service disruptions for services in our Greenville, SC datacenter related to this morning's scheduled maintenance activities. We'll provide more information as we know more.
Posted May 13, 2016 - 02:32 EDT
This incident affected: Network (Network - Greenville, SC), DaaS (DaaS - Greenville, SC), BaaS (BaaS with Veeam - Greenville, SC), and IaaS (IaaS - Greenville, SC).