What happened
On 3/30/2019, New Zealand's domestic air traffic services experienced a 47-minute disruption. The outage originated at the Christchurch air traffic management centre, where controllers lost processed radar surveillance data on their screens. The system automatically transitioned to a degraded mode, and in some sectors, aircraft positions were displayed based on predicted flight plan data rather than live radar.
Simultaneously, the primary communication channels between controllers and pilots were interrupted, forcing the use of backup radio systems. The disruption also briefly affected controllers in Auckland. Despite the loss of primary surveillance and communication tools, 41 aircraft were in the air at the time; all landed safely, though two aircraft chose to return to their departure points.
The investigation
The investigation focused on why a single component failure caused a widespread loss of essential services. Investigators determined that the event began with the failure of a capacitor within an uninterruptable power supply (UPS) unit. This failure released conductive debris inside the unit, causing a short circuit that tripped a circuit breaker and cut power to the connected loads.
While the network equipment was designed with redundant power sources (System A and System B), the investigation found that the critical MPLS network equipment had been incorrectly wired. Both power supply connections for this equipment were plugged into the same UPS (System A) rather than being split between the two independent systems. This error meant that when the first UPS failed, the entire network lost power.
Findings
- The outage was triggered by a capacitor failure in a UPS unit.
- A short circuit caused by debris from the failed capacitor led to the loss of power to the primary network equipment.
- Essential network equipment was improperly connected to a single power source, negating the intended redundancy.
- Maintenance procedures designed to identify such wiring errors had been deferred by management.
Safety action
Following the incident, Airways implemented several changes to prevent a recurrence, including moving network equipment to a new facility with physically separated power rooms. The organization also adopted a new color-coding standard for all UPS cables to make wiring errors easier to detect. Additionally, Airways updated its maintenance protocols to ensure regular power outage checks are performed to verify the integrity of dual power connections.