Degraded Performance of all Phrase TMS (EU) components between 03:01 PM and 4:58 PM CET

Incident Report for Phrase

Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 03:01 PM CET and 04:58 PM CET on December 14th, 2023 which led to a partial outage of all TMS (EU) components and what Phrase engineers are doing to prevent these issues from happening again.

Timeline

2:55 PM CET: Initiated a procedure to update the Phrase NextMT configuration.

3:01 PM CET: The first application server failed to start properly.

3:05 PM CET: The second application server failed to start properly.

3:09 PM CET: The third application server failed to start properly, while the remaining 85% of servers started without problems.

3:35 PM CET: A support engineer alerted a TMS engineer about multiple customer reports regarding system availability issues; however, the on-duty engineer was not immediately contacted.

4:25 PM CET: Contacted the on-duty engineer.

4:54 PM CET: Identified three failed application servers through monitoring data.

4:58 PM CET: Removed the failed servers from the load balancer, resolving issues for all customers.

5:16 PM CET: Restarted and reintegrated the previously failed application servers into the load balancer.

Root Cause

Due to human error, the TMS application's new configuration was not delivered in a safe manner, causing 15% of the servers to start in an unhealthy state. For the load balancer, however, the servers looked healthy and the load balancer continued passing the requests to these broken servers. Eliminating the failed servers from the load balancer and restarting them resolved the issue.

Actions to Prevent Recurrence

Updated the configuration delivery process to default to a safe mode.
The application monitoring has been enhanced to detect such application failures automatically.
The load balancer configuration has been improved to identify this kind of failed server automatically.
Internal training has been organized to ensure on-duty engineers are promptly contacted and informed.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Dec 19, 2023 - 11:57 CET

Resolved

Our engineers identified the root cause of the degraded performance and the incident is resolved.

Posted Dec 14, 2023 - 17:17 CET

Investigating

Our engineering team is investigating the degraded performance of all the Phrase TMS (EU) components.

Posted Dec 14, 2023 - 16:31 CET

This incident affected: Phrase TMS (EU) (Analytics, API, CAT web editor, File processing, Machine translation, Project management, Term base, Translation memory).