Performance Disruption of Phrase TMS (EU) Term base component between 1:00 and 3:10 PM CEST
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 01:00 PM CEST and 03:10 PM CEST on December 6, 2023 which led to a major outage of the Term base service and what Phrase engineers are doing to prevent these issues from happening again.

Timeline

01:00 PM CEST: We commenced Elasticsearch maintenance that required restarting of the Term base instance node. During the restart of the first node, we noticed degraded performance.

01:20 PM CEST: The cluster rebalancing took place after the restart of the cluster node and after completion appeared stable. We investigated the root cause.01:40 PM CEST: In the meantime, due to a resource constraint, the cluster began overutilizing the master node. This caused the cluster to be unable to complete the shards rebalancing. 

2:45 PM CEST: We stabilized the cluster by moving shards to less utilized nodes and stopping the automated shards rebalancing. Performance returned to normal.

Root Cause

Resources were unavailable to complete the reallocation of the Elasticsearch cluster shards.

Actions to Prevent Recurrence

Extend cluster resources - nodes will be added to the cluster as well as optimizing the memory configuration of nodes. This was done at 8 PM December 6, 2023.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Dec 07, 2023 - 11:37 CET

Resolved
Phrase TMS (EU) experienced intermittent issues affecting the Term base component. Phrase engineers investigated the issues and were able to resolve them.
Posted Dec 06, 2023 - 13:00 CET