Service interruption to some legacy cloud services
Incident Report for Servers Australia
Postmortem

Overview

At the times listed above, services operating on the Servers Australia legacy cloud platform experienced an extended service interruption, resulting in the inability to access or ‘power on’ the services until functionality was restored. While initial investigations believed the scope of impact to have been a single active hypervisor node that experienced a fault, the cause of this incident has now been linked to a bug related to the removal of an unused backup node from the legacy platform. During the removal of this backup node, two active hypervisor nodes on the platform experienced a kernel panic and rendered cloud services offline on the platform. Due to the nature of this rare bug, Servers Australia made the decision early on during the troubleshooting process to involve the platform vendor on which the legacy platform operates. The platform vendor was able to assist the Servers Australia team with manual inspection of affected services, as some required a repair, due to how the platform distributes storage between nodes. 

Timeline

10:40 - Hardware Engineers commence the planned shutdown and physical removal of the decommissioned backup nodes whilst onsite. 
10:43 - First monitoring reports are received in relation to affected legacy cloud services. A status notification is raised to status.mysau.com.au to notify impacted clients. Internal investigation of the cause of the impact begins. 
10:55 - Engineers confirm that due to the unanticipated kernel panic of two active hypervisor nodes, some integrated disk repairs are needed due to how storage is distributed with this platform. 
11:23 - Contact is made with the platform vendor for assistance with investigating the root cause of this issue and for restoring functionality for impacted services by repairing integrated disks. 
11:50 - As work is coordinated with the platform vendor, impacted services begin to come online. Due to the nature of the incident, this required the platform vendors to manually commence health checks of virtual disks before bringing the service online. Intermittently between 11:50 and 21:57, services are brought online with the assistance of the platform vendor once disks have been inspected and repaired if required.
21:57 - All managed legacy cloud services in our monitoring system report as healthy and online after virtual disk inspection works have completed. 

Resolution 

Due to the nature of this incident and the unanticipated impact on some virtual disks on the platform, we required the cooperation of the platform vendor to diagnose the cause of the issue further and seek an immediate resolution. As mentioned above, this was found to be tied to the planned removal of unused backup hypervisor nodes, which caused an unexpected interruption to active hypervisor nodes. The result of this interruption affected the distributed storage layer of the platform. 

Further Action  

As this is a legacy environment, Servers Australia has committed to working with all current customers utilising the older platform to offer the opportunity to migrate to our new VMware-powered Cloud Server platform. The benefits of the newer platform are:

  • An Enterprised-backed Solution Powered by VMware vSAN Hyperconverged Storage (HCI), vCenter & vCloud Director.
  • HPE Gen10 Enterprise Servers with 12G Storage Performance and 40Gbe of Networking Capacity per server.
  • Dual Intel Xeon Gold 6226R Processors with Turbo boost as high as 3.9Ghz.
  • Our VMware vSAN storage platform will now be powered entirely by NVMe Drives, which are 10x times faster than SSD technology.
  • Dual Data Centre Availability. 

We intend to provide to our clients a 90-day migration window in which you can operate both the original legacy service alongside the new VMware-powered Cloud Servers, without any additional cost to you during the migration period. This is intended to provide you with time to try out, test, and migrate to the new platform.

Posted May 07, 2021 - 17:28 AEST

Resolved
This incident has been resolved.
Posted May 07, 2021 - 17:24 AEST
Monitoring
Engineers have now finalised the bulk of work with our platform vendor and all managed and monitored services are reporting as online. Further analysis is underway and we intend to provide a report next week leading to the reason for this interruption. If you believe your legacy cloud service is still impacted due to this incident, please contact our Support team via the MySAU portal or directly via our contact number on 1300 788 862 and we can investigate this with you further.
Posted Apr 30, 2021 - 21:57 AEST
Update
We have restored access to the vast majority of cloud VMs and are working to bring the remaining online ASAP
Posted Apr 30, 2021 - 20:43 AEST
Update
The team have finalised another set of services manually, which are now online. We are now working to finalise a small number of clients with affected services, and hope to have these finalised within the next hour.
Posted Apr 30, 2021 - 17:25 AEST
Update
Our Engineers are continuing to work closely with the platform vendor. Unanticipated delays due to the size and complexity of remaining services have slowed down works, but we are continuing this at a constant rate with services coming back online regularly after their analysis. We greatly appreciate the cooperation of all remaining impacted clients.
Posted Apr 30, 2021 - 16:15 AEST
Update
The platform vendor is continuing to work on assessing individual services to assist with restoration efforts. This has been hampered in some instances due to the size of virtual disks on the platform. As each service is reviewed, it is being brought online by the team. We again appreciate your cooperation and understanding during this time.
Posted Apr 30, 2021 - 14:49 AEST
Update
Engineers have isolated the cause of disruption for the remaining offline services, and are still working with the platform provider to finalise the restoration of these services. We thank you for your understanding at this time.
Posted Apr 30, 2021 - 13:31 AEST
Update
Engineers are still undergoing work directly with the platform vendor to restore the remaining impacted services, as some services require individual inspection before being brought online.
Posted Apr 30, 2021 - 12:46 AEST
Update
Engineers have begun works to bring online the remaining impacted services tied to this incident. There may be a delay for some of the remaining services, but we intend to have them all brought online as soon as possible.
Posted Apr 30, 2021 - 12:03 AEST
Update
Servers Australia Engineers are working directly with the platform vendor to analyse the situation further. While these ongoing works are underway some cloud services may continue to be impacted. We will provide a further update once these works complete and the remaining impacted services can be restored.
Posted Apr 30, 2021 - 11:37 AEST
Identified
Engineers have collected further information relating to the cause of disruption and are now assessing the environment. Further work is currently underway to bring the impacted services online as a top priority.
Posted Apr 30, 2021 - 11:22 AEST
Update
Engineers are continuing to actively review affected services and are collecting data on the reason for the error, to assist with restoring connectivity for the impacted services.
Posted Apr 30, 2021 - 11:06 AEST
Investigating
Engineers have received monitoring alerts that a single host in our legacy cloud server platform has reported an error, and applicable servers on this host may experience a service interruption or ongoing degradation. No cloud services on our new cloud platform have been impacted.

A full investigation is underway and further updates will be provided shortly.
Posted Apr 30, 2021 - 10:47 AEST
This incident affected: Services (Cloud Servers).