Degraded Performance of all Phrase Strings (EU) Components except OTA between November 12, 2024 9:20 AM CET and November 12, 2024 11:00 AM CET
Incident Report for Phrase
Postmortem

Root Cause Analysis

November 12, 2024

Introduction

We would like to share more details about the events that occurred with Phrase between 9:20 AM CET and 11:00 AM CET on November 12, 2024 which led to a gradual outage of all the Phrase Strings (EU) components excepting OTA and what Phrase engineers are doing to prevent these issues from reoccurring.

Timeline

9:22 AM CET: We received a latency warning from our monitoring tool regarding our background job queues.

9:30 AM CET: We identified that a large number of enqueued webhook delivery jobs were causing high memory usage in our Redis instance which eventually affected the processing of other background jobs due to Redis being unresponsive.

9:40 AM CET: The root cause was identified as a large amount of misconfigured and duplicated webhooks that were triggered due to high activity.

9:55 AM CET: We began cleaning up the queue by identifying the duplicated webhook delivery jobs.

10:10 AM CET: The clean up was completed and webhook delivery returned to normal.

10:15 AM CET: We re-triggered the background jobs for translation statistics and search indexing which were affected by the outage.

11:00 AM CET: Systems stabilized.

13:20 AM CET: Processing of the re-triggered background jobs completed and the incident was declared as resolved.

Root Cause

The root cause of this incident was identified as a large number of misconfigured and duplicated webhooks triggered by high user activity. This resulted in a high load on backend services which affected the processing of several background jobs.

Actions to Prevent Recurrence

  • Introduce hard limits and uniqueness checks at the project level to prevent duplicate webhook configurations.
  • Increase resource allocation for Redis to ensure stability in background job processing.
Posted Nov 14, 2024 - 10:43 CET

Resolved
The incident has been resolved.
Posted Nov 12, 2024 - 13:23 CET
Update
Our engineers are continuing to monitor the performance, all components except the Translation center are now operational.
Posted Nov 12, 2024 - 11:44 CET
Monitoring
Our engineers implemented a fix and are monitoring the results.
Posted Nov 12, 2024 - 10:14 CET
Identified
Our engineers have identified the root cause of a degraded performance of all Phrase Strings (EU) components except OTA and are working on a fix.
Posted Nov 12, 2024 - 09:57 CET
This incident affected: Phrase Strings (EU) (Translation center, Repo sync, Email delivery, Ordering, In-context editor, API).