On the 17th March 2021, we experienced an extended network disruption in specific New South Wales facilities for some clients that do not have dual-site redundant services. There was a brief disruption for some clients that do have dual site redundancy.
The issue was caused by a software bug that was observed in multiple switches within our core network, this bug unexpectedly appeared after around 500 days of uptime. Regrettably, we were unable to control this disruption or resolve it as quickly as expected as multiple factors influenced reaching a resolution.
The impacted switches operate in a stacked setup and are designed to failover in the event of an issue. Due to the issue being software related, when the stack was failing over to the secondary member, it exhibited the same software bug behaviour. This resulted in the switching stack refusing to pass some traffic for multiple clients.
The switching hardware vendor was contacted for further review. No root cause was identified from initial investigations, so an in-depth diagnosis was required by the vendor. Whilst the vendor was diagnosing further we instigated re-routing of traffic around the affected switching stack. This took time for some clients due to their setups or configuration. All of this work required manual intervention from the network engineers as the network is redundant in nature, however, due to the firmware bug, redundancy was not functioning as expected at this time.
The vendor determined the only viable solution to temporarily resolve the issue was to reload the switching stack and cause a small outage during the reload window. Due to the nature of traffic that was still actively traversing connected equipment, we could not facilitate an immediate reload of the switching stack in Equinix SY1 without causing an immediate, widespread impact to clients and also the potential for corrupted data or data loss occurring for those clients.
Final mitigation was implemented where possible for remaining impacted customers with specific setups, and a plan was finalised to shift the traffic that was traversing the faulty switch stack. Once the new paths were in place and traffic had been shifted, a reload was performed at approximately 10:30 pm as per the advice from the vendor. Network engineers confirmed services were operational and began reverting remediation works and re-routing that had been implemented.
After further debugging, the vendor has confirmed that the cause of this disruption was related to a bug within the software used on our switching stacks. This bug caused stack members to kernel panic, resulting in a degraded state for the entire stack that could not be resolved by manual failover. While we are now aware of what the root cause was, a plan has already begun internally to migrate critical services away from this infrastructure.
The largest group of impacted clients were utilising our legacy firewall platform. This platform resides in a single datacentre and required a large amount of manual work to shift traffic. To ensure that these services are not impacted again due to a single zone failure, we will be migrating all legacy firewall services over to our new Dual Zoned VMware NSX firewalls.
This migration will take place in the next 90 days at no additional or on-going cost to relevant clients including the migration costs which will be waived. This move will provide total datacentre and network redundancy to all clients that were previously on the legacy firewall platform and further information will follow in the coming days to the affected clients.
To safeguard against this happening again, and to provide greater redundancy to any client that was impacted during this outage, we are going to be performing the below tasks as a matter of priority;
Updates will be provided to any client that may be affected by the above works and work will be performed during a scheduled maintenance window.
We sincerely apologise for any issues caused and are working hard to provide all clients with a reliable and issue-free service, along with being transparent if an issue does occur.
Below is a complete timeline of events from the incident on the 17th March 2021.
05:32 - Initial alerts for interruption sent to on-call engineers and troubleshooting commenced.
05:50 - Engineers identify an issue with a switching stack in SYD2 between active/standby members.
06:15 - Engineers commence a switching stack re-balance which did not resolve the issue.
A reload of both stack members was issued to bring services back online. This caused a brief service interruption to SYD2 based services for five minutes while the stack members came back online.
06:20 - Services confirmed operational. Engineers begin monitoring and establish an initial investigation into the unanticipated event that occurred.
08:49 - Initial alerts for interruption sent to engineers.
08:55 - Engineers commence troubleshooting to identify the root cause and see if this is related to the earlier incident at Syncom SYD2.
09:15 - Initial investigations isolated the source of the issue to SY1.
09:21 - NOC team commences initial remediation works. Traffic diverted to alternate paths via SY3. Services dependant on SY1 remain affected.
10:07 - Root cause identified as an issue with our core switching stack in SY1.
10:12 - Engineers establish that a number of operational aggregated services that connect to the core switching stack would be impacted if drastic remediation was taken. Due to this, the same re-balance & re-load option previously done in SYD2 was not immediately possible.
10:30 - Engineers relay information regarding the core switching stack to network hardware vendor to identify non-disruptive stack re-balancing method. A non-disruptive re-stacking method is sought in an attempt to avoid a complete switching stack reload which was undertaken in SYD2.
10:40 - Engineers begin testing the non-disruptive stack re-balancing method provided by the hardware vendor.
10:44 - The re-balancing method results are reviewed and confirmed to be successful in the testing environment. Engineers begin to prepare to re-balance the SY1 switching stack.
10:51 - Re-balancing of SY1 switching stack is initiated. Engineers begin to monitor.
10:55 - Engineers confirm via monitoring that the effect of the re-balancing method was unsuccessful, rollback procedures are initiated.
11:00 - Rollback procedures are completed. Monitoring continues to show additional services now impacted and the scope of impact is analysed. The MySAU Portal team confirms access to the portal is slowed due to impacted components.
11:05 - Commenced diverting traffic from services in SY4 around the affected site.
11:09 - All SY4 traffic confirmed diverted.
11:20 - Engineers begin to analyse switch stack configuration in an attempt to further diagnose the nature of the problem.
11:42 - A forwarding problem is confirmed that now affects the impacted switching stack.
11:47 - Further discussion with hardware vendors is conducted to assist with isolating the issues.
12:11 - Engineers identify some traffic is still entering into the affected site from external providers, further action is taken to prevent inbound traffic entering via SY1.
12:15 - Inbound traffic testing re-routing is confirmed to have resolved some issues.
13:05 - Engineers commence diverting traffic from SYD2 via SY3.
13:07 - Engineers complete traffic diversion for SYD2.
13:16 - Emergency spare equipment is prepared with compatible firmware from the network hardware vendor, in preparation for onsite engineers to install if required.
13:17 - Investigations continue with the network hardware vendor in relation to the switching stack.
13:45 - Technician dispatched to the SY1 facility with SLA hardware as a precaution.
13:46 - Discussion with the network hardware vendor continue who are actively working on impacted equipment via remote console.
13:56 - Initial discussion is initiated in relation to how to most efficiently mitigate any impact on aggregated services that traverse the SY1 networking infrastructure.
14:31 - Works with the network hardware vendor continue and internal work to reduce congestion to the MySAU portal begins to alleviate delay in accessing the portal.
14:53 - Some connections between SY1 and other sites are restored through manual workarounds and changes.
15:15 - Engineers identify and commence diverting traffic from 5GN SDC via SY3.
15:18 - Engineers complete traffic diversion for 5GN SDC.
15:47 - Manual re-routing is conducted for some specific impacted services, where possible to do so.
15:54 - Additional services are brought online via manual intervention.
16:11 - The network hardware vendor’s investigation is ongoing via remote console.
16:38 - While awaiting the vendor, teams prepare workflows for potential outcomes that may impact aggregated services as well as bring online more services manually where possible.
17:44 - The network hardware vendor exhausts all other remediation options cause and indicates the only resolution possible is to apply a configuration change and to perform a reload of the switching stack which will impact services that rely on this site.
18:00 - Engineering teams enact the prepared workflow for this scenario and begin further remediation works for aggregated services that will be impacted due to the reload.
18:19 - Mitigation works commence for some client services including Private Cloud (HADRaaS) and other clients with multi-zone configurations to ensure no connectivity is lost when the switching stack is reloaded.
18:47 - Remediation works continue, with temporary networking configuration deployed to mitigate the potential impact.
19:58 - Onsite engineers begin physical cabling work to further mitigate the potential impact of aggregated services.
20:46 - Physical cabling is complete, network engineers review and begin configuration changes to prepare for cutover of the aggregated services.
22:13 - A core component of aggregated services completes cutover, providing the green light to commence the reload of the switching stack.
22:27 - Engineers commence a reload of the switching stack in cooperation with the network hardware vendor.
22:35 - The switching stack is successfully reloaded and returned to service. External monitoring reports impacted services online.
22:43 - Engineers initiate reverting remediation works to return services to optimal paths.
23:18 - All network paths are restored to service and temporary bypasses removed.