WRITTEN BY ON March 23, 2021.
Let’s start with the facts. Last week we had times when the FL3XX platform was not responsive over 3 days. Namely, it happened on:
The system was indeed responding but it was so slow that unless you literally went for a coffee after, say, clicking on a button, you would not see any reaction to your input.
How could such a thing go undetected until it affected most of our users? It’s so simple, and we have learned a lot from this. We implemented some improvements to our infrastructure and like everything else they went through the various test phases. Everything was fine and working as expected. No issues, no flags anywhere. Then we released the new version to production and it worked fine for some time. When the problems hit, at the times above, it was peak hours. All of a sudden the load on the system was far up and the database delayed a couple of responses, simply due to its own normal speed. As those responses queued and more kept coming, the database slowed even more while trying to cope with the increased demand. Of course, this is a runaway issue that in a matter of minutes becomes unsolvable, even if traffic should diminish suddenly. The traffic jam is just too big to solve itself.
We now understand where we failed. Our tests include traffic load of the various rigs that go through the testing processes. These loads are based on our peak load. Ok, great. Now it happens that our load increases every month, as we add more customers and our customers add more aircraft or more users on the platform or start using more features and integrations. This is not a regular flow and as with every short-term graph, it’s highly volatile. But over the longer term, we see an upward trend. Now, of course, we’re considering this trend with a formal assessment of traffic forecast.
We had a similar surprise last year. When the Covid lockdowns ended after the first wave, we witnessed a sharp uptick in traffic as everyone started flying again. But in the meantime, we had added many customers to the platform so over one week we saw the usage of the platform jump by 2.5x. Thankfully we had in place a project to upgrade the entire platform ready to implement. We had not executed it because during Covid, the traffic subdued and we were very far from overheating the systems. So we implemented the project very quickly and nobody noticed much other than the system being slightly slow here and there.
To complete the story, we implemented a large set of under the hood improvements to our infrastructure. We rarely talk about these because they are not really visible. The FL3XX platform is now 10x more complex than it was just a year ago and counts now no less than 40 different service providers to ensure the speed and reliability that you expect anywhere on earth and 8766 hours a year. These changes were improving reliability and speed, though they obviously failed on the latter!
Finally a word of apology to you, our customer, and user. FL3XX is for a number of tasks a mission-critical tool. There are no alternatives for several data components or features in the platform should it suddenly fail. We are sorely aware of that and we take this very seriously. Our entire team was focused on these issues when they arose last week and we are asking lots of questions to ensure this problem does not happen again.
From a technical perspective, we have managed honorably with only 3 times 2 hours of disruption. From a business perspective, it’s 6 hours too many, but as we innovate continuously across the entire platform and as we do it at a price point and order of magnitude less than other systems of comparable complexity, I feel it’s a very good performance nonetheless. Not everything is solved and there might be some slight slowness over the coming days, as we plow through all the details, but there should be no more runaway issues with heavy disruption. (picture: One of our dashboards monitoring the system)
I hope this makes it clearer what the nature of the problem has been and while it does not mend any disruption you might have suffered, at least you have an idea of how it came about.
Be informed of our latest news, articles, tips, and insights.