On Tuesday Jan 23 afternoon at 12:01 EST our on-call support team and network operations was informed of a network outage by our monitoring system.
As a result of the outage, incoming traffic on certain IPs in the 69.90 sub-range was not routed to the correct cloud servers.
Several services were affected until our team resolved the issue and restored connectivity at 14:46 EST.
Third party personnel at our Toronto data centre were involved briefly during this process.
The resolution involved a forced restart of the affected router and restoration of network settings.
During this time, our support team was updating affected clients on progress via support tickets.
Problem and Remediation
At 08:15, we observed a switch reboot within our cloud storage network, which caused a number of cloud servers to report storage failures. This was reported by our internal and third-party monitoring services. Our on-site team noticed the issue immediately and initiated remediation and server reboots over the following 90 minutes.
Root cause
Further investigation showed that this issue was caused by a power interruption within the building control, combined with a local redundant power supply failure due to overload.
Preventative Action
Replacement of faulty local UPS and provisioning of additional UPS redundancy. Coordination with building control on power switch overs.