WRITTEN BY ON June 11, 2021.
Starting at 20:44z we experienced connectivity issues. The system was up and running but not reachable from the outside.
Our development and DevOps teams identified an issue in our cloud service provider in the EU-Central-1 Region where the access point of FL3XX is hosted. This lead to several simultaneous failures in the load balancing system that takes care of cluster balancing. Despite being redundant, these simultaneous failures had a negative effect on the connectivity.
The root cause analysis of our provider found that the issue
was a failure of a control system which disabled multiple air handlers in the affected Availability Zone. These air handlers move cool air to the servers and equipment, and when they were disabled, ambient temperatures began to rise. Servers and networking equipment in the affected Availability Zone began to power-off when unsafe temperatures were reached. Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity. While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone.
When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged.
This system will remain inactive until we are able to determine what triggered it improperly. In the meantime, alternate fire suppression measures are being used to protect the data center. Once cooling was restored and the servers and network equipment was re-powered, affected instances recovered quickly.
Before the issue was solved by the provider, the FL3XX team was able to launch additional off-site resources to re-enable access to our servers. The system was fully inspected to guarantee stability and performance. No data loss was detected.
The incident was closed and the system fully live at 22:31z
At FL3XX we treat incidents like these very seriously. While not 100% avoidable, each incident helps us to improve our infrastructure and procedures. The learnings we take from this incident provide valuable insights.
Our internal monitoring, emergency alerting and recovery procedures worked well.
The chain of events leading to multiple failures in the load balancing system is extremely rare but highlights a shortcoming in our architecture.
To mitigate issues like the one presented by the cloud provider FL3XX will work hard not only to provide more redundancy but to split redundant servers over different regions.
In the name of the entire team, we apologize for having disrupted your operations today. We know how much each of our customers relies on the 24/7 availability of FL3XX, and we’ll work hard to restore the trust you put in us.
Be informed of our latest news, articles, tips, and insights.