August 13, 2025
We would like to share more details about the events that occurred with Phrase between August 12, 2025 05:45 PM CEST and August 13, 2025 10:40 AM CEST, which led to a degraded performance of the workflow engine component of Phrase Orchestrator (EU DC) and what Phrase engineers are doing to prevent these issues from reoccurring.
August 12, 2025 05:45 PM CEST: The executed workflow throughput started to decrease, so workflows started more slowly than usual. The workflow scheduling for execution was unaffected.
August 12, 2025 07:53 PM CEST: First external report of the issue received.
August 12, 2025 08:31 PM CEST: Workflow processing significantly impacted.
August 13, 2025 10:40 AM CEST: Issue identified and fix deployed. Pending workflows resumed execution, working off the queue.
August 13, 2025 07:38 PM CEST: All delayed workflows completed. Normal operations restored.
A change was implemented into the Orchestrator’s message handling logic. This change introduced a mechanism where the workflow engine had to explicitly acknowledge receipt of each message before new messages were submitted for execution. This change was initially introduced to prevent the workflow engine from being overloaded, in cases of rapid workflow trigger executions.
Due to a bug in this new acknowledgment logic, some acknowledgments were not properly registered when facing production traffic. As a result, the system incorrectly assumed that the engine was at full capacity, and gradually reduced the amount of messages sent, even though there was spare capacity on the engine.
The system did not automatically notify the team about the issue, since the engine was not overloaded at the infrastructure level - the condition that would normally trigger an alert and page the on-call engineer.
Importantly: