We would like to share more details about the events that occurred with Phrase between 03:01 PM CET and 04:58 PM CET on December 14th, 2023 which led to a partial outage of all TMS (EU) components and what Phrase engineers are doing to prevent these issues from happening again.
2:55 PM CET: Initiated a procedure to update the Phrase NextMT configuration.
3:01 PM CET: The first application server failed to start properly.
3:05 PM CET: The second application server failed to start properly.
3:09 PM CET: The third application server failed to start properly, while the remaining 85% of servers started without problems.
3:35 PM CET: A support engineer alerted a TMS engineer about multiple customer reports regarding system availability issues; however, the on-duty engineer was not immediately contacted.
4:25 PM CET: Contacted the on-duty engineer.
4:54 PM CET: Identified three failed application servers through monitoring data.
4:58 PM CET: Removed the failed servers from the load balancer, resolving issues for all customers.
5:16 PM CET: Restarted and reintegrated the previously failed application servers into the load balancer.
Due to human error, the TMS application's new configuration was not delivered in a safe manner, causing 15% of the servers to start in an unhealthy state. For the load balancer, however, the servers looked healthy and the load balancer continued passing the requests to these broken servers. Eliminating the failed servers from the load balancer and restarting them resolved the issue.
Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.