Jobs Not Being Processed
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 04:50 PM CEST  on November 13, 2023 and 04:00 PM CEST on November 14, 2023 which led to a partial outage of some of the services under the Phrase Strings application and what Phrase engineers are doing to prevent these issues from happening again.

On November 13, addition of new middleware in the Strings application job processing framework led to an unexpected outage of background job processing which led to delay in the completion of some actions:

  • Creation of OTA releases
  • Merging of branches
  • Comment notifications

During the outage, notifications and webhooks delivery were also impacted.

Timeline (14th November, 2023)

12:14 PM CEST: The problem is identified after reports of parts of the application behaving unexpectedly.

13:41 PM CEST: The pull request that introduced this change is reverted and most of the affected parts of the application return to their normal behavior.

17:00 PM CEST: Most of the critical background jobs that were missed during the outage are manually triggered.

Root Cause

The root cause of the issue was narrowed down to new middleware that was introduced into the background job processing framework that was meant to provide better internal visibility around the job execution process. Although there are checks in place to ensure that every change made to the application’s codebase fulfills a predefined set of requirements before it can be rolled out, in this case, the issue was only visible once it got into the production environment.

Actions to Prevent Recurrence

To prevent any further disruptions of this kind, additional checks and monitoring have been put in place.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Nov 17, 2023 - 11:39 CET

Resolved
One of our employed middleware stopped performing jobs after merging a PR. Jobs that were being enqueued were not being processed, there were no errors or retries triggered during this time.
Posted Nov 13, 2023 - 16:15 CET