ATL DaaS Latency
Incident Report for Green Cloud Defense
Postmortem

Summary of Incident

As part of the on-going migration activities in the Atlanta data center, Green Cloud expanded a SAN’s capacity to accommodate data moved from another storage device that had been scheduled to be physically relocated. The volume of the migrated data eventually overloaded the destination SAN’s compute capabilities, despite telemetry indicating the device was well under the vendor’s stated acceptable IOPs, storage utilization, VM counts, and performance reserves metrics. The storage vendor’s root cause analysis for the incident indicated a design flaw in the algorithms used for this telemetry. Green Cloud reversed the migration temporarily and is augmenting storage capacity with an additional SAN to complete the migration.

Root Cause Analysis

Excerpt from storage vendor’s Incident Summary:

“About 50% of the workflow is composed of <8k block size, which was negatively impacting performance of the VMstore, due to the rate at which the internal queues are serviced for this type of workload. Comparing the current [data] with historical benchmarks demonstrates limitations in the performance reserve calculations of the VMstore. At the time of the observed latencies, the workload present was consuming the full capacity of the VMstore, however the [algorithm] reported 70-80% utilization.

It appears a reading of 65% utilization translates to almost 100% utilization on [this model]. “

Remediation

Green Cloud has re-balanced the DaaS workloads in Atlanta supported by this model of SAN to remain well beneath the 65% performance reserves metric per device, and is reviewing the environment in other data centers with the same expectation.

The Network Management System and related process will also be adjusted accordingly to report when the new threshold might be reached or exceeded.

Storage vendor has expedited delivery of an additional SAN (a top-tier model with 3x VM capacity, application density) to accommodate the remaining migrations in the Atlanta data center.

Green Cloud will fully review the utility and warranty of this particular SAN model for all services, and will adjust the capacity management plan and supply chain accordingly.

Finally, Green Cloud is accelerating plans to on-board and test alternative storage vendors.

Posted Mar 21, 2019 - 11:23 EDT

Resolved
Following the rebalance process completed this weekend, Green Cloud network operations center has seen no further recurrences of the latency issue for Atlanta DaaS users. At this time, should you have issues with this service, please report them individually to Support at 877-465-1217 or support@gogreencloud.com.
Posted Mar 18, 2019 - 18:29 EDT
Update
Green Cloud engineers have completed rebalancing workloads across several SANs that support our DaaS services in the Atlanta data center. At this time, the overall latency performance metric has been consistently within acceptable thresholds across the environment. The network operations center will continue to monitor performance and will address recurrences should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com.
Posted Mar 16, 2019 - 15:04 EDT
Monitoring
After working with our storage vendor's engineering team throughout the evening, Green Cloud is continuing now with rebalancing the workloads in Atlanta to mitigate the latency issue impacting DaaS users. The network operations center has seen the overall performance back to expected levels within the last hour. Additional updates throughout the weekend will be made if thresholds are once again exceeded. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 16, 2019 - 00:15 EDT
Investigating
High latency has returned and is far above the levels expected. We are currently engaging our storage vendor for a solution. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com.
Posted Mar 15, 2019 - 20:58 EDT
Update
Green Cloud has shown the overall latency metrics for Atlanta DaaS subscribers stay within acceptable levels throughout the majority of the day today, and the NOC is continuing to monitor. Additional updates throughout the evening and weekend will be made only if thresholds are once again exceeded and performance further degraded. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com.
Posted Mar 15, 2019 - 17:40 EDT
Update
Green Cloud has seen overall latency metrics maintain acceptable levels over the last two hours and the network operations center is continuing to monitor the entire environment to address recurrences or additional impacts should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 15, 2019 - 14:13 EDT
Update
Green Cloud is continuing with the re-balance actions for DaaS storage in our Atlanta data center, to help mitigate individual user impact; this is concurrent to the storage vendor investigating into the root cause. Overall latency metrics have improved over the last two hours, however, are not yet consistent. The network operations center is continuing to monitor the entire environment to address these recurrences or additional impacts should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 15, 2019 - 11:56 EDT
Update
This morning, as many desktop users return to the business day, we have seen latency on the impacted storage increase above expected thresholds. We have escalated again the high severity incident with our vendor. The network operations center is continuing to monitor the entire environment to address these recurrences or additional impacts should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 15, 2019 - 09:56 EDT
Update
Green Cloud engineers have completed the initial rebalance actions to mitigate the storage latency issues reported for DaaS users in the Atlanta data center. Since approximately 7:30pm Eastern time, overall latency metrics have been within expected thresholds across the environment. The network operations center is continuing to monitor and will address recurrences or additional impacts should they arise. We will continue to work with our storage vendors to determine root cause. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com.
Posted Mar 14, 2019 - 22:39 EDT
Update
Green Cloud engineers are working to rebalance workload across several SANs supporting our DaaS services. in the Atlanta data center. The expectation is that we will continue with this course of action for the remainder of the day and throughout the evening. As the overall workload decreases, the impact of the storage latency will subside for individual users. Status updates will be provided here at least every 2 hours until the incident is considered fully resolved. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 14, 2019 - 15:56 EDT
Update
Green Cloud management is continuing to escalate the high severity incident with our storage vendor. Meanwhile, engineers are focused on any and all courses of action that may decrease the level of impact to DaaS end-users in Atlanta. The network operations center is continuing to monitor the entire environment to address recurrences or additional impacts should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 14, 2019 - 14:41 EDT
Update
Green Cloud engineers are continuing to escalate the high severity incident with our storage vendor to expedite a resolution to the latency issue impacting DaaS users in Atlanta. The network operations center is continuing to monitor the entire environment to address recurrences or additional impacts should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 14, 2019 - 13:29 EDT
Monitoring
Green Cloud engineers are working closely with our storage vendor to expedite a resolution to the latency issue impacting DaaS users in Atlanta. The network operations center is continuing to monitor the entire environment to address recurrences or additional impacts should they arise. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 14, 2019 - 12:00 EDT
Identified
Green Cloud engineers have isolated the cause of the ATL DaaS latency issues and have engaged the vendor, opening a high severity Incident. We are working to mitigate at this time. Additional status updates will be provided as they become available. If you have any questions or concerns, please contact support at 877-465-1217 or support@gogreencloud.com
Posted Mar 14, 2019 - 10:57 EDT
Investigating
Green Cloud network operations center is aware of reports of storage latency for our ATL DaaS subscribers at this time. Engineers are currently working to identify the underlying cause and engaging vendors as necessary. Updates will be provided as soon as they are available. Please contact support at 877-465-1217 or support@gogreencloud.com with any questions or concerns.
Posted Mar 14, 2019 - 10:37 EDT
This incident affected: DaaS (DaaS - Atlanta, GA).