Overview
At the times listed above, services operating on the Servers Australia legacy cloud platform experienced an extended service interruption, resulting in the inability to access or ‘power on’ the services until functionality was restored. While initial investigations believed the scope of impact to have been a single active hypervisor node that experienced a fault, the cause of this incident has now been linked to a bug related to the removal of an unused backup node from the legacy platform. During the removal of this backup node, two active hypervisor nodes on the platform experienced a kernel panic and rendered cloud services offline on the platform. Due to the nature of this rare bug, Servers Australia made the decision early on during the troubleshooting process to involve the platform vendor on which the legacy platform operates. The platform vendor was able to assist the Servers Australia team with manual inspection of affected services, as some required a repair, due to how the platform distributes storage between nodes.
Timeline
10:40 - Hardware Engineers commence the planned shutdown and physical removal of the decommissioned backup nodes whilst onsite.
10:43 - First monitoring reports are received in relation to affected legacy cloud services. A status notification is raised to status.mysau.com.au to notify impacted clients. Internal investigation of the cause of the impact begins.
10:55 - Engineers confirm that due to the unanticipated kernel panic of two active hypervisor nodes, some integrated disk repairs are needed due to how storage is distributed with this platform.
11:23 - Contact is made with the platform vendor for assistance with investigating the root cause of this issue and for restoring functionality for impacted services by repairing integrated disks.
11:50 - As work is coordinated with the platform vendor, impacted services begin to come online. Due to the nature of the incident, this required the platform vendors to manually commence health checks of virtual disks before bringing the service online. Intermittently between 11:50 and 21:57, services are brought online with the assistance of the platform vendor once disks have been inspected and repaired if required.
21:57 - All managed legacy cloud services in our monitoring system report as healthy and online after virtual disk inspection works have completed.
Resolution
Due to the nature of this incident and the unanticipated impact on some virtual disks on the platform, we required the cooperation of the platform vendor to diagnose the cause of the issue further and seek an immediate resolution. As mentioned above, this was found to be tied to the planned removal of unused backup hypervisor nodes, which caused an unexpected interruption to active hypervisor nodes. The result of this interruption affected the distributed storage layer of the platform.
Further Action
As this is a legacy environment, Servers Australia has committed to working with all current customers utilising the older platform to offer the opportunity to migrate to our new VMware-powered Cloud Server platform. The benefits of the newer platform are:
We intend to provide to our clients a 90-day migration window in which you can operate both the original legacy service alongside the new VMware-powered Cloud Servers, without any additional cost to you during the migration period. This is intended to provide you with time to try out, test, and migrate to the new platform.