We would like to share more details about the events that occurred with Phrase between 04:50 PM CEST on November 13, 2023 and 04:00 PM CEST on November 14, 2023 which led to a partial outage of some of the services under the Phrase Strings application and what Phrase engineers are doing to prevent these issues from happening again.
On November 13, addition of new middleware in the Strings application job processing framework led to an unexpected outage of background job processing which led to delay in the completion of some actions:
During the outage, notifications and webhooks delivery were also impacted.
12:14 PM CEST: The problem is identified after reports of parts of the application behaving unexpectedly.
13:41 PM CEST: The pull request that introduced this change is reverted and most of the affected parts of the application return to their normal behavior.
17:00 PM CEST: Most of the critical background jobs that were missed during the outage are manually triggered.
The root cause of the issue was narrowed down to new middleware that was introduced into the background job processing framework that was meant to provide better internal visibility around the job execution process. Although there are checks in place to ensure that every change made to the application’s codebase fulfills a predefined set of requirements before it can be rolled out, in this case, the issue was only visible once it got into the production environment.
To prevent any further disruptions of this kind, additional checks and monitoring have been put in place.
Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.