November 12, 2024
We would like to share more details about the events that occurred with Phrase between 9:20 AM CET and 11:00 AM CET on November 12, 2024 which led to a gradual outage of all the Phrase Strings (EU) components excepting OTA and what Phrase engineers are doing to prevent these issues from reoccurring.
9:22 AM CET: We received a latency warning from our monitoring tool regarding our background job queues.
9:30 AM CET: We identified that a large number of enqueued webhook delivery jobs were causing high memory usage in our Redis instance which eventually affected the processing of other background jobs due to Redis being unresponsive.
9:40 AM CET: The root cause was identified as a large amount of misconfigured and duplicated webhooks that were triggered due to high activity.
9:55 AM CET: We began cleaning up the queue by identifying the duplicated webhook delivery jobs.
10:10 AM CET: The clean up was completed and webhook delivery returned to normal.
10:15 AM CET: We re-triggered the background jobs for translation statistics and search indexing which were affected by the outage.
11:00 AM CET: Systems stabilized.
13:20 AM CET: Processing of the re-triggered background jobs completed and the incident was declared as resolved.
The root cause of this incident was identified as a large number of misconfigured and duplicated webhooks triggered by high user activity. This resulted in a high load on backend services which affected the processing of several background jobs.