Internet Provider Service Disruption in our Atlanta Datacenter
Incident Report for Green Cloud Defense
Postmortem

Reason for Outage (RFO) – Atlanta IaaS Service Interruption

  • Event Date: October 16, 2018
  • Event Start Time: 10:54 a.m. Eastern
  • Event Resolution Time: 11:58 a.m. Eastern

Summary

At approximately 10:54 a.m. Eastern on Tuesday, October 16, 2018, Green Cloud experienced loss of network connectivity to IaaS services in Atlanta, GA, due to a power event with an internet carrier. The carrier restored power at 11:58 a.m. and IaaS services were returned to fully operational status.

Event Timeline

  • 10:54 a.m. - Green Cloud NOC opened a Priority 1 Incident and engineers immediately began investigating the loss of connectivity.
  • 11:02 a.m. – The major Incident response process was invoked wherein the network status page was updated with a new event, event notifications delivered, and an internal crash bridge opened for the outage response team.
  • 11:07 a.m. - The outage response team isolated the cause of the connectivity loss of both links of a redundant interconnect to an internet carrier in the Atlanta data center. Vendor incidents were simultaneously opened with the carrier in question, Internap (INAP), as well as QTS, the Atlanta data center vendor.
  • 11:15 a.m. - Green Cloud architects begin planning to failover network traffic.
  • 11:26 a.m. – INAP indicates they have dispatched a technician to the QTS location.
  • 11:40 a.m. – The INAP technician dispatched finds an INAP Inventory manager on-site at QTS in process of decommissioning equipment. INAP determines that power had been erroneously disconnected from the entire rack in which Green Cloud circuits are provisioned.
  • 11:45 a.m. – Green Cloud has established the scope of the required changes to move network traffic and mitigate the outage.
  • 11:58 a.m. – INAP restores power back to the affected rack.
  • 12:00 p.m. – Green Cloud NOC receives a cleared alert notification; engineers confirm access to IaaS is restored, and event status updated from Investigating to Monitoring.
  • 12:45 p.m. – Event status updated from Monitoring to Resolved.

Root Cause

The root cause of the incident was identified by INAP as human error made by an inventory manager during a decommissioning maintenance task. Network failover between the dual links provisioned for Green Cloud’s network traffic did not occur due to loss of power to both of the redundant network routers in INAP's rack.

Remediation

Green Cloud completed maintenance the following evening, Thursday October 18th, at 12:00 a.m. Eastern, to move the ATL data center internet traffic to its own dual service provider class routers and implemented multi-carrier connectivity via BGP. This network maintenance moved full control of the failover process to Green Cloud and removed the dependency on a single carrier in that site, regardless of the level (or lack of) the carrier’s power or link redundancy.

Posted Oct 24, 2018 - 14:59 EDT

Resolved
At this time we have not seen any recurrence of this issue for over an hour. The Network Operations Center will continue to monitor our infrastructure for stability and performance. We strive to proactively identify any lingering issues resulting from this issue, but please contact us immediately if at this time your service is not performing up to your expectations.

We will be providing a RFO for this event in the coming days, once our vendor and engineering teams perform their investigation and we have a clear picture of root cause and future mitigation steps. We will post that here on our Status page and will provide it to all partners who requested it via support ticket.

As always, if you need anything else, we can be reached at 877-465-1217 or support@gogreencloud.com. We sincerely apologize for the inconvenience this issue caused you and your customers.
Posted Oct 16, 2018 - 14:29 EDT
Monitoring
It appears that all services have been restored. Green Cloud engineers continue to monitor for any lingering issues and will resolve if any come up. We will be working with our vendor partner to discover root cause so we can provide you and your customers with the most information we can. Please call us at 877-465-1217 or support@gogreencloud.com if your service has not been restored.
Posted Oct 16, 2018 - 12:17 EDT
Update
We are seeing that our carrier may be beginning to restore service and our engineers continue to work with our vendor partners to make sure everything comes back properly. Please call us at 877-465-1217 or support@gogreencloud.com if you have further questions or anything we can assist with.
Posted Oct 16, 2018 - 12:03 EDT
Update
Our carrier in Atlanta has resources en route to the Atlanta datacenter to work on restoring service. We will post more as we learn more but please call us at 877-465-1217 or support@gogreencloud.com if you have further questions or anything we can assist with.
Posted Oct 16, 2018 - 11:39 EDT
Identified
Green Cloud engineers have identified the issue as a carrier problem in Atlanta. We are currently working with the carrier to resolve this issue ASAP. We can be reached at 877-465-1217 or support@gogreencloud.com. Updates will be provided every 30 minutes or until the event is resolved.
Posted Oct 16, 2018 - 11:16 EDT
Investigating
The Network Operations Center is investigating a possible service disruption inn Atlanta. More details will be provided in updates to follow within the next 30 minutes.
Posted Oct 16, 2018 - 11:02 EDT
This incident affected: DaaS (DaaS - Atlanta, GA), Network (Network - Atlanta, GA), and IaaS (IaaS - Atlanta, GA).