ATL DaaS Storage Issue

Incident Report for Green Cloud Defense

Postmortem

Summary:

At 3:57pm Eastern, an NVDIMM hardware error was found on one of the SANs that supports Green Cloud DaaS services in our Atlanta datacenter. This caused increased storage write latency during peak usage times. The hardware error was not reported by the SAN alert engine as would be expected. Green Cloud immediately opened a Severity 1 case with the SAN vendor. In order to improve service, the vendor recommended failing over to controller B after a health check of that controller.

This failover was initiated at 4:58pm and the controller failover did not perform as expected. When the failover process completed, storage latency significantly increased to a level that workload file systems were unable to function. At this time, root cause analysis for this impact is still ongoing by our storage vendor.

Continuing to work with our vendor, Green Cloud immediately turned efforts towards getting controller B operating as expected alongside dispatching hardware to replace controller A. Each step in the troubleshooting process required careful consideration due to the amount of time it takes to perform reboots and to verify the state of core controller components. After extensive troubleshooting, running diagnostic utilities, and analyzing logs proved unsuccessful, a soft reboot of controller A was initiated. The soft reboot of controller A did not clear the errors, and a physical hardware reset of controller A was also unsuccessful at clearing the errors and restoring operation.

The storage vendor strongly advised that before replacing controller A, it was best practice to have controller B in a fully functional state so that the new controller could sync the database from the running controller. If this step was skipped, it would have been a very risky process to have two new controllers brought online in this unit. Out of an abundance of caution to ensure customer data integrity, additional troubleshooting and verification steps were taken to investigate the degraded performance and high CPU load on active controller B before taking more intrusive steps. A physical hardware reset was performed on controller B in an attempt to bring the controller back into normal operation, and it did not come back online, failing to boot to it’s OS entirely. A second hardware dispatch request for controller B was initiated.

The state of the unit at this point was that controller A was booted, but due to the NVDIMM errors, could not properly mount the file system, and controller B was failed. Replacing either controller was not deemed a safe enough operation, so troubleshooting efforts were turned towards finding a way for controller A to temporarily boot long enough to allow us to replace controller B. Since the NVDIMM component was the issue preventing proper boot, efforts focused on finding a way to either safely replace it or bypass the hardware error. The NVDIMM is not deemed a replaceable field part by our storage vendor but after consulting with additional vendor platform engineers, they were able to put together a safe procedure to allow moving the good NVDIMM component from the failed controller B into controller A. After booting successfully, controller A took over as primary and began functioning as intended at 2:05am.

The SAN was then extensively checked for integrity. Once complete, the first replacement controller was slotted as controller B. Controller B completed the configuration sync and firmware updates, and then a standard controller failover was successfully performed, bringing the unit back into a redundant and performant state.

Green Cloud’s Support and Cloud Infrastructure teams began proactive service recovery and validation efforts to verify the state of affected workloads while the second replacement controller was en route. First, all affected DaaS administration components were rebooted. Then, the individual consoles of DaaS administration components were loaded to perform file system checks and initiate any additional reboots if necessary. Connectivity to external URLs of individual DaaS tenants were verified where possible. In the process of troubleshooting, it was noted that any desktop showing a VMware tools error needed to be rebooted manually to properly register with the DaaS tenant appliances. Reboots of all desktops showing this error were completed by 10:24am. The second replacement controller arrived on-site and replaced the original controller A, with all vendor hardware verification steps complete at 10:35am.

Timeline

3:57pm – NVDIMM hardware error is discovered. Green Cloud immediately calls storage vendor opening a Severity 1 case.
4:58pm – Failover to controller B completes.
5:00pm – Monitoring indicates customer impact due to the increased latency.
5:08pm – Status Page posted.
5:15pm - Green Cloud's Cloud Infrastructure team connects to incident bridge along with SAN vendor to coordinate troubleshooting efforts.
5:30pm-6:30pm - Ongoing troubleshooting by vendor’s support teams as well as additional engineering resources requested by Green Cloud.
6:31pm – Software reboot of controller A initiated.
7:03pm – Controller A is reseated.
7:25pm-9:30 pm - Continued in-depth analysis by storage vendor, attempting to restore service.
9:30pm – Controller B is rebooted.
10:00pm-1:00am - Ongoing troubleshooting including senior engineering resources at the vendor. Planning of how to safely swap NVDIMM components started.
1:36am – NVDIMM move between Controller B to A is started.
2:05am – SAN OS boots and data verification begins.
2:14am – Verification of affected DaaS components is started by Green Cloud Partner Support.
2:20am – Controller B is replaced and firmware updates begin.
2:49am – Failover to controller B successfully completes. Additional hardware verification steps are started.
3:37am – Status Page moved to monitoring after confirming recovery on a number of affected DaaS tenants.
6:15am – Status Page resolved after initial reboots of all affected DaaS tenant appliances and unified access gateways.
7:50am - Individual file system checks of DaaS tenant appliances and unified access gateways complete.
8:55am - Green Cloud Partner Support notes that in troubleshooting any desktops showing a VMware tools error need a reboot to properly register with the DaaS tenant appliances. A full list is generated and proactive reboots are started.
9:44am - Controller A is replaced.
10:24am - Proactive reboots of any desktops showing a VMware Tools error are completed.
10:35am - Controller A firmware updated and verification complete.

Remediation

Once the root cause analysis is completed by the SAN vendor Green Cloud will work hand in hand with our vendor to make sure no other Green Cloud unit is currently susceptible to this failure.
Vendor will provide remediation for any defects found, such as the failure of alerting of the initial NVDIMM error on the unit. Additionally, Green Cloud will perform regular manual NVDIMM error checks until the vendor resolves the alert engine issue.
Green Cloud will work with the SAN vendor to understand why the health check of controller B was successful but resulted in latency and prevent this in the future.
Green Cloud will work with VMware support to investigate any possible ways to more proactively and comprehensively troubleshoot and diagnose post-recovery issues with individual DaaS tenants.

Posted Apr 16, 2021 - 14:16 EDT

Resolved

At this time, we are not aware of any ongoing impact to ATL DaaS services other than a few isolated instances in which we are working directly with those partners individually. If you have any issues and are not currently engaged with support, please write in ASAP to support@gogreencloud.com or call 877-465-1217.

Posted Apr 10, 2021 - 06:15 EDT

Monitoring

Green Cloud has completed the remediation steps and confirmed the affected datastore is fully functional. Green Cloud Support will be proactively reviewing individual DaaS administration components for any lingering issues. In the meantime, if you have any issues and are not currently engaged with support, please email support@gogreencloud.com or call 877-465-1217.

Posted Apr 10, 2021 - 03:37 EDT

Update

Green Cloud has completed the hardware replacement and is seeing progress towards recovery. An update will be provided within an hour as testing and remediation steps are completed. In the meantime, we can be reached at support@gogreencloud.com as needed.

Posted Apr 10, 2021 - 02:35 EDT

Update

Green Cloud continues to work on this issue with the storage vendor which has been escalated to their highest level of engineering. Green Cloud will continue to provide any substantial updates on progress made. We can be reached at support@gogreencloud.com as needed.

Posted Apr 10, 2021 - 01:18 EDT

Update

Green Cloud continues to work on this issue with the storage vendor. Hardware replacement is underway while we are working on a path towards recovery. Green Cloud will continue to provide any substantial updates on progress made. We can be reached at support@gogreencloud.com if needed.

Posted Apr 09, 2021 - 23:11 EDT

Update