At 3:57pm Eastern, an NVDIMM hardware error was found on one of the SANs that supports Green Cloud DaaS services in our Atlanta datacenter. This caused increased storage write latency during peak usage times. The hardware error was not reported by the SAN alert engine as would be expected. Green Cloud immediately opened a Severity 1 case with the SAN vendor. In order to improve service, the vendor recommended failing over to controller B after a health check of that controller.
This failover was initiated at 4:58pm and the controller failover did not perform as expected. When the failover process completed, storage latency significantly increased to a level that workload file systems were unable to function. At this time, root cause analysis for this impact is still ongoing by our storage vendor.
Continuing to work with our vendor, Green Cloud immediately turned efforts towards getting controller B operating as expected alongside dispatching hardware to replace controller A. Each step in the troubleshooting process required careful consideration due to the amount of time it takes to perform reboots and to verify the state of core controller components. After extensive troubleshooting, running diagnostic utilities, and analyzing logs proved unsuccessful, a soft reboot of controller A was initiated. The soft reboot of controller A did not clear the errors, and a physical hardware reset of controller A was also unsuccessful at clearing the errors and restoring operation.
The storage vendor strongly advised that before replacing controller A, it was best practice to have controller B in a fully functional state so that the new controller could sync the database from the running controller. If this step was skipped, it would have been a very risky process to have two new controllers brought online in this unit. Out of an abundance of caution to ensure customer data integrity, additional troubleshooting and verification steps were taken to investigate the degraded performance and high CPU load on active controller B before taking more intrusive steps. A physical hardware reset was performed on controller B in an attempt to bring the controller back into normal operation, and it did not come back online, failing to boot to it’s OS entirely. A second hardware dispatch request for controller B was initiated.
The state of the unit at this point was that controller A was booted, but due to the NVDIMM errors, could not properly mount the file system, and controller B was failed. Replacing either controller was not deemed a safe enough operation, so troubleshooting efforts were turned towards finding a way for controller A to temporarily boot long enough to allow us to replace controller B. Since the NVDIMM component was the issue preventing proper boot, efforts focused on finding a way to either safely replace it or bypass the hardware error. The NVDIMM is not deemed a replaceable field part by our storage vendor but after consulting with additional vendor platform engineers, they were able to put together a safe procedure to allow moving the good NVDIMM component from the failed controller B into controller A. After booting successfully, controller A took over as primary and began functioning as intended at 2:05am.
The SAN was then extensively checked for integrity. Once complete, the first replacement controller was slotted as controller B. Controller B completed the configuration sync and firmware updates, and then a standard controller failover was successfully performed, bringing the unit back into a redundant and performant state.
Green Cloud’s Support and Cloud Infrastructure teams began proactive service recovery and validation efforts to verify the state of affected workloads while the second replacement controller was en route. First, all affected DaaS administration components were rebooted. Then, the individual consoles of DaaS administration components were loaded to perform file system checks and initiate any additional reboots if necessary. Connectivity to external URLs of individual DaaS tenants were verified where possible. In the process of troubleshooting, it was noted that any desktop showing a VMware tools error needed to be rebooted manually to properly register with the DaaS tenant appliances. Reboots of all desktops showing this error were completed by 10:24am. The second replacement controller arrived on-site and replaced the original controller A, with all vendor hardware verification steps complete at 10:35am.