Summary of Incident
As part of the on-going migration activities in the Atlanta data center, Green Cloud expanded a SAN’s capacity to accommodate data moved from another storage device that had been scheduled to be physically relocated. The volume of the migrated data eventually overloaded the destination SAN’s compute capabilities, despite telemetry indicating the device was well under the vendor’s stated acceptable IOPs, storage utilization, VM counts, and performance reserves metrics. The storage vendor’s root cause analysis for the incident indicated a design flaw in the algorithms used for this telemetry. Green Cloud reversed the migration temporarily and is augmenting storage capacity with an additional SAN to complete the migration.
Root Cause Analysis
Excerpt from storage vendor’s Incident Summary:
“About 50% of the workflow is composed of <8k block size, which was negatively impacting performance of the VMstore, due to the rate at which the internal queues are serviced for this type of workload. Comparing the current [data] with historical benchmarks demonstrates limitations in the performance reserve calculations of the VMstore. At the time of the observed latencies, the workload present was consuming the full capacity of the VMstore, however the [algorithm] reported 70-80% utilization.
It appears a reading of 65% utilization translates to almost 100% utilization on [this model]. “
Remediation
Green Cloud has re-balanced the DaaS workloads in Atlanta supported by this model of SAN to remain well beneath the 65% performance reserves metric per device, and is reviewing the environment in other data centers with the same expectation.
The Network Management System and related process will also be adjusted accordingly to report when the new threshold might be reached or exceeded.
Storage vendor has expedited delivery of an additional SAN (a top-tier model with 3x VM capacity, application density) to accommodate the remaining migrations in the Atlanta data center.
Green Cloud will fully review the utility and warranty of this particular SAN model for all services, and will adjust the capacity management plan and supply chain accordingly.
Finally, Green Cloud is accelerating plans to on-board and test alternative storage vendors.