Reason for Outage (RFO) – Atlanta IaaS Service Interruption
- Event Date: October 16, 2018
- Event Start Time: 10:54 a.m. Eastern
- Event Resolution Time: 11:58 a.m. Eastern
Summary
At approximately 10:54 a.m. Eastern on Tuesday, October 16, 2018, Green Cloud experienced loss of network connectivity to IaaS services in Atlanta, GA, due to a power event with an internet carrier. The carrier restored power at 11:58 a.m. and IaaS services were returned to fully operational status.
Event Timeline
- 10:54 a.m. - Green Cloud NOC opened a Priority 1 Incident and engineers immediately began investigating the loss of connectivity.
- 11:02 a.m. – The major Incident response process was invoked wherein the network status page was updated with a new event, event notifications delivered, and an internal crash bridge opened for the outage response team.
- 11:07 a.m. - The outage response team isolated the cause of the connectivity loss of both links of a redundant interconnect to an internet carrier in the Atlanta data center. Vendor incidents were simultaneously opened with the carrier in question, Internap (INAP), as well as QTS, the Atlanta data center vendor.
- 11:15 a.m. - Green Cloud architects begin planning to failover network traffic.
- 11:26 a.m. – INAP indicates they have dispatched a technician to the QTS location.
- 11:40 a.m. – The INAP technician dispatched finds an INAP Inventory manager on-site at QTS in process of decommissioning equipment. INAP determines that power had been erroneously disconnected from the entire rack in which Green Cloud circuits are provisioned.
- 11:45 a.m. – Green Cloud has established the scope of the required changes to move network traffic and mitigate the outage.
- 11:58 a.m. – INAP restores power back to the affected rack.
- 12:00 p.m. – Green Cloud NOC receives a cleared alert notification; engineers confirm access to IaaS is restored, and event status updated from Investigating to Monitoring.
- 12:45 p.m. – Event status updated from Monitoring to Resolved.
Root Cause
The root cause of the incident was identified by INAP as human error made by an inventory manager during a decommissioning maintenance task. Network failover between the dual links provisioned for Green Cloud’s network traffic did not occur due to loss of power to both of the redundant network routers in INAP's rack.
Remediation
Green Cloud completed maintenance the following evening, Thursday October 18th, at 12:00 a.m. Eastern, to move the ATL data center internet traffic to its own dual service provider class routers and implemented multi-carrier connectivity via BGP. This network maintenance moved full control of the failover process to Green Cloud and removed the dependency on a single carrier in that site, regardless of the level (or lack of) the carrier’s power or link redundancy.