Partial Network Disruption - Sydney

Incident Report for Servers Australia

Postmortem

Network Disruption - Sydney, NSW Australia

Overview

On the 17th March 2021, we experienced an extended network disruption in specific New South Wales facilities for some clients that do not have dual-site redundant services. There was a brief disruption for some clients that do have dual site redundancy.

The issue was caused by a software bug that was observed in multiple switches within our core network, this bug unexpectedly appeared after around 500 days of uptime. Regrettably, we were unable to control this disruption or resolve it as quickly as expected as multiple factors influenced reaching a resolution.

The impacted switches operate in a stacked setup and are designed to failover in the event of an issue. Due to the issue being software related, when the stack was failing over to the secondary member, it exhibited the same software bug behaviour. This resulted in the switching stack refusing to pass some traffic for multiple clients.

The switching hardware vendor was contacted for further review. No root cause was identified from initial investigations, so an in-depth diagnosis was required by the vendor. Whilst the vendor was diagnosing further we instigated re-routing of traffic around the affected switching stack. This took time for some clients due to their setups or configuration. All of this work required manual intervention from the network engineers as the network is redundant in nature, however, due to the firmware bug, redundancy was not functioning as expected at this time.

The vendor determined the only viable solution to temporarily resolve the issue was to reload the switching stack and cause a small outage during the reload window. Due to the nature of traffic that was still actively traversing connected equipment, we could not facilitate an immediate reload of the switching stack in Equinix SY1 without causing an immediate, widespread impact to clients and also the potential for corrupted data or data loss occurring for those clients.

Final mitigation was implemented where possible for remaining impacted customers with specific setups, and a plan was finalised to shift the traffic that was traversing the faulty switch stack. Once the new paths were in place and traffic had been shifted, a reload was performed at approximately 10:30 pm as per the advice from the vendor. Network engineers confirmed services were operational and began reverting remediation works and re-routing that had been implemented.

After further debugging, the vendor has confirmed that the cause of this disruption was related to a bug within the software used on our switching stacks. This bug caused stack members to kernel panic, resulting in a degraded state for the entire stack that could not be resolved by manual failover. While we are now aware of what the root cause was, a plan has already begun internally to migrate critical services away from this infrastructure.

The largest group of impacted clients were utilising our legacy firewall platform. This platform resides in a single datacentre and required a large amount of manual work to shift traffic. To ensure that these services are not impacted again due to a single zone failure, we will be migrating all legacy firewall services over to our new Dual Zoned VMware NSX firewalls.
This migration will take place in the next 90 days at no additional or on-going cost to relevant clients including the migration costs which will be waived. This move will provide total datacentre and network redundancy to all clients that were previously on the legacy firewall platform and further information will follow in the coming days to the affected clients.

To safeguard against this happening again, and to provide greater redundancy to any client that was impacted during this outage, we are going to be performing the below tasks as a matter of priority;

Build new direct paths for specific services between our Sydney sites to ensure we can failover in the event of a major switch stack failure.
Migrate any critical services away from these switch stacks that were affected.
Implement a roll-out plan to patch the bug on all affected switches.
Consider replacing the vendor with another vendor to avoid any further issues.
Migrate all legacy firewall services to our new Dual Zone VMware platform to ensure that these have redundancy in the event of a datacentre failure.

Updates will be provided to any client that may be affected by the above works and work will be performed during a scheduled maintenance window.

We sincerely apologise for any issues caused and are working hard to provide all clients with a reliable and issue-free service, along with being transparent if an issue does occur.

Below is a complete timeline of events from the incident on the 17th March 2021.

Timeline of Events

Syncom SYD2 - 17/03/2021

05:32 - Initial alerts for interruption sent to on-call engineers and troubleshooting commenced.
05:50 - Engineers identify an issue with a switching stack in SYD2 between active/standby members.
06:15 - Engineers commence a switching stack re-balance which did not resolve the issue.
A reload of both stack members was issued to bring services back online. This caused a brief service interruption to SYD2 based services for five minutes while the stack members came back online.
06:20 - Services confirmed operational. Engineers begin monitoring and establish an initial investigation into the unanticipated event that occurred.

Sydney - Multiple Sites - 17/03/2021

08:49 - Initial alerts for interruption sent to engineers.
08:55 - Engineers commence troubleshooting to identify the root cause and see if this is related to the earlier incident at Syncom SYD2.
09:15 - Initial investigations isolated the source of the issue to SY1.
09:21 - NOC team commences initial remediation works. Traffic diverted to alternate paths via SY3. Services dependant on SY1 remain affected.
10:07 - Root cause identified as an issue with our core switching stack in SY1.
10:12 - Engineers establish that a number of operational aggregated services that connect to the core switching stack would be impacted if drastic remediation was taken. Due to this, the same re-balance & re-load option previously done in SYD2 was not immediately possible.
10:30 - Engineers relay information regarding the core switching stack to network hardware vendor to identify non-disruptive stack re-balancing method. A non-disruptive re-stacking method is sought in an attempt to avoid a complete switching stack reload which was undertaken in SYD2.
10:40 - Engineers begin testing the non-disruptive stack re-balancing method provided by the hardware vendor.
10:44 - The re-balancing method results are reviewed and confirmed to be successful in the testing environment. Engineers begin to prepare to re-balance the SY1 switching stack.
10:51 - Re-balancing of SY1 switching stack is initiated. Engineers begin to monitor.
10:55 - Engineers confirm via monitoring that the effect of the re-balancing method was unsuccessful, rollback procedures are initiated.
11:00 - Rollback procedures are completed. Monitoring continues to show additional services now impacted and the scope of impact is analysed. The MySAU Portal team confirms access to the portal is slowed due to impacted components.
11:05 - Commenced diverting traffic from services in SY4 around the affected site.
11:09 - All SY4 traffic confirmed diverted.
11:20 - Engineers begin to analyse switch stack configuration in an attempt to further diagnose the nature of the problem.
11:42 - A forwarding problem is confirmed that now affects the impacted switching stack.
11:47 - Further discussion with hardware vendors is conducted to assist with isolating the issues.
12:11 - Engineers identify some traffic is still entering into the affected site from external providers, further action is taken to prevent inbound traffic entering via SY1.
12:15 - Inbound traffic testing re-routing is confirmed to have resolved some issues.
13:05 - Engineers commence diverting traffic from SYD2 via SY3.
13:07 - Engineers complete traffic diversion for SYD2.
13:16 - Emergency spare equipment is prepared with compatible firmware from the network hardware vendor, in preparation for onsite engineers to install if required.
13:17 - Investigations continue with the network hardware vendor in relation to the switching stack.
13:45 - Technician dispatched to the SY1 facility with SLA hardware as a precaution.
13:46 - Discussion with the network hardware vendor continue who are actively working on impacted equipment via remote console.
13:56 - Initial discussion is initiated in relation to how to most efficiently mitigate any impact on aggregated services that traverse the SY1 networking infrastructure.
14:31 - Works with the network hardware vendor continue and internal work to reduce congestion to the MySAU portal begins to alleviate delay in accessing the portal.
14:53 - Some connections between SY1 and other sites are restored through manual workarounds and changes.
15:15 - Engineers identify and commence diverting traffic from 5GN SDC via SY3.
15:18 - Engineers complete traffic diversion for 5GN SDC.
15:47 - Manual re-routing is conducted for some specific impacted services, where possible to do so.
15:54 - Additional services are brought online via manual intervention.
16:11 - The network hardware vendor’s investigation is ongoing via remote console.
16:38 - While awaiting the vendor, teams prepare workflows for potential outcomes that may impact aggregated services as well as bring online more services manually where possible.
17:44 - The network hardware vendor exhausts all other remediation options cause and indicates the only resolution possible is to apply a configuration change and to perform a reload of the switching stack which will impact services that rely on this site.
18:00 - Engineering teams enact the prepared workflow for this scenario and begin further remediation works for aggregated services that will be impacted due to the reload.
18:19 - Mitigation works commence for some client services including Private Cloud (HADRaaS) and other clients with multi-zone configurations to ensure no connectivity is lost when the switching stack is reloaded.
18:47 - Remediation works continue, with temporary networking configuration deployed to mitigate the potential impact.
19:58 - Onsite engineers begin physical cabling work to further mitigate the potential impact of aggregated services.
20:46 - Physical cabling is complete, network engineers review and begin configuration changes to prepare for cutover of the aggregated services.
22:13 - A core component of aggregated services completes cutover, providing the green light to commence the reload of the switching stack.
22:27 - Engineers commence a reload of the switching stack in cooperation with the network hardware vendor.
22:35 - The switching stack is successfully reloaded and returned to service. External monitoring reports impacted services online.
22:43 - Engineers initiate reverting remediation works to return services to optimal paths.
23:18 - All network paths are restored to service and temporary bypasses removed.

Posted Mar 23, 2021 - 12:25 AEDT

Resolved

This incident has been resolved.

Posted Mar 22, 2021 - 17:03 AEDT

Monitoring

The reload of the network device has now been completed and engineers are currently reviewing impacted services. Initial monitoring reports are promising. We are working to review all reported disruptions from clients through the MySAU portal at this time.

Our 24/7 team will be continuing to monitor impacted services and can be contacted if you believe you are continuing to experience a service disruption. A full investigation is underway and will be provided as a final update to this status once completed.

Posted Mar 17, 2021 - 22:52 AEDT

Update

Engineers are now in the next stage of remediation works and are preparing to conduct a controlled reload of a network device. It is expected that once completed connectivity for disrupted services will be restored.

This is now commencing.

Posted Mar 17, 2021 - 22:27 AEDT

Update

Engineers are continuing remediation works that are well underway. The current estimation for work completion is within another 60 minutes, with applicable updates to be provided.

Posted Mar 17, 2021 - 21:45 AEDT

Update

Engineers have commenced works now. Updates will be provided as soon as possible.

Posted Mar 17, 2021 - 21:20 AEDT

Update

Engineers are finalising the planned remediation works now, with a few minor tasks remaining before the works commence to prevent any unexpected disruptions.

As mentioned in the previous update an additional notice will be provided once these works are about to commence.

Posted Mar 17, 2021 - 20:28 AEDT

Update

After further consultation with our hardware vendors, Engineers are planning to implement further remediation works which are expected to alleviate the remaining service disruption being experienced.

Final preparation is underway and this is expected to be initiated in the next 60 minutes. A further update will be provided in relation to this, prior to commencing.

Posted Mar 17, 2021 - 19:24 AEDT

Update

Hardware vendors are actively assisting the onsite team with reviewing affected network infrastructure and are working to provide the next steps towards a resolution for the remaining impacted services. Service restoration for all impacted clients remains our top priority.

Posted Mar 17, 2021 - 17:34 AEDT

Update

Engineers are currently working further with hardware vendors to investigate the cause of the remaining disruptions, with onsite engineers assisting with additional investigative efforts.

We are actively working to identify and resolve services still impacted and will continue this work into the evening.

Posted Mar 17, 2021 - 16:54 AEDT

Update

Network engineers are continuing their work to rectify disruptions for clients that are still impacted. At this point in time work is being directed to resolve connectivity for services utilising firewall appliances from our SY1 datacentre.

A further update will be provided once this has progressed further.

Posted Mar 17, 2021 - 16:18 AEDT

Update

Engineers have completed another round of changes and additional services are now reporting in with connectivity restored. We're continuing works this afternoon to bring online all remaining disrupted services.

Posted Mar 17, 2021 - 15:34 AEDT

Update

Engineers are working with our hardware vendors to further investigate affected network infrastructure in the SY1 datacentre which is believed to be the cause of disruption for remaining services. As we continue to coordinate onsite works we will provide further updates accordingly.

Posted Mar 17, 2021 - 14:34 AEDT

Update

Engineers are continuing to troubleshoot and review the remaining services that are still reporting network disruptions at this point in time.

Posted Mar 17, 2021 - 13:40 AEDT

Update

Engineers have confirmed from further work that some additional services have been restored by utilising the alternate network path.

We are continuing to review all services impacted by this disruption to restore connectivity.

Posted Mar 17, 2021 - 13:11 AEDT

Update

Engineers are preparing to shift some traffic within our Sydney infrastructure to resolve this disruption. This may cause a path change for some services with active networking along with TCP resets due to the path change.

An update will be provided again once this has been completed.

Posted Mar 17, 2021 - 12:48 AEDT

Update

Engineers have begun to implement additional changes that are seeing some affected services from within our Sydney region restored. We are continuing to work on restoring service connectivity for all impacted clients as our top priority.

Posted Mar 17, 2021 - 12:25 AEDT

Update

Engineers have completed initial works which have not proven successful in resolving experienced network disruptions. Engineers are continuing to work on a resolution as a matter of urgency.

Posted Mar 17, 2021 - 12:06 AEDT

Update

This maintenance window has been extended due to the complexity of the maintenance works required. At this time all troubleshooting effort is being invested to restore connectivity for affected services and is our utmost priority.

We continue to appreciate your cooperation during these emergency works.

Posted Mar 17, 2021 - 11:48 AEDT

Update

Engineers are continuing to conduct emergency works to rectify service disruptions being experienced by some clients in our Sydney locations. No ETA for work completion is available at this point in time.

Further updates will continue to be provided on a regular basis.

Posted Mar 17, 2021 - 11:31 AEDT

Update

Engineers are currently still working inside of this emergency maintenance window to rectify service disruptions being experienced. We appreciate your patience during this time.

Posted Mar 17, 2021 - 11:14 AEDT

Update

Engineers are currently undergoing the emergency maintenance window now. During this time some services may be impacted for several minutes.

An update will be provided once complete.

Posted Mar 17, 2021 - 10:58 AEDT

Update

Engineers are preparing to commence emergency maintenance works to resolve the network disruption being experienced by some clients. The emergency maintenance may result in a continuous disruption of several minutes while the work is underway.

An update will be provided once this maintenance is completed.

Posted Mar 17, 2021 - 10:48 AEDT

Update

Engineers believe they have identified the root cause being experienced currently and are working to organise an emergency fix to restore full connectivity for the remaining services. A further update will be provided again as soon as possible.

Posted Mar 17, 2021 - 10:20 AEDT

Update

Engineers are still conducting further troubleshooting to identify the root cause and restore connectivity for the remaining services that are experiencing a disruption.

Posted Mar 17, 2021 - 09:41 AEDT

Identified

Engineers have implemented a minor change to restore connectivity to affected services, and monitoring has confirmed network is restored for some services. Engineers are manually checking the remaining services.

Further analysis will be performed to identify the source of the disruption this morning.

Posted Mar 17, 2021 - 09:21 AEDT

Investigating

Engineers are reviewing further reports of network disruption that is impacting services in our SY4 locations. Further updates will be provided shortly.

Posted Mar 17, 2021 - 09:01 AEDT

Monitoring

Engineers have had to perform a controlled network event to restore connectivity to some services this morning. During the connectivity resynchronisation, services in our SYD2 location may have experienced a disruption of up to a few minutes. Further investigation is underway to review the cause of this disruption

Posted Mar 17, 2021 - 06:29 AEDT

Investigating

Engineers are reviewing monitoring alerts in relation to a potential network disruption experienced by a small portion of clients within the Sydney region this morning.

Further updates on the matter will be supplied when available

Posted Mar 17, 2021 - 05:49 AEDT

This incident affected: Regions (Sydney) and Services (Network).