Degraded Performance of Phrase Orchestrator (EU) Workflow Engine component between August 12, 2025 05:45 PM CEST and August 13, 2025 10:40 AM CEST

Incident Report for Phrase

Postmortem

Root Cause Analysis

August 13, 2025

Introduction

We would like to share more details about the events that occurred with Phrase between August 12, 2025 05:45 PM CEST and August 13, 2025 10:40 AM CEST, which led to a degraded performance of the workflow engine component of Phrase Orchestrator (EU DC) and what Phrase engineers are doing to prevent these issues from reoccurring.

Timeline

August 12, 2025 05:45 PM CEST: The executed workflow throughput started to decrease, so workflows started more slowly than usual. The workflow scheduling for execution was unaffected.

August 12, 2025 07:53 PM CEST: First external report of the issue received.

August 12, 2025 08:31 PM CEST: Workflow processing significantly impacted.

August 13, 2025 10:40 AM CEST: Issue identified and fix deployed. Pending workflows resumed execution, working off the queue.

August 13, 2025 07:38 PM CEST: All delayed workflows completed. Normal operations restored.

Root Cause

A change was implemented into the Orchestrator’s message handling logic. This change introduced a mechanism where the workflow engine had to explicitly acknowledge receipt of each message before new messages were submitted for execution. This change was initially introduced to prevent the workflow engine from being overloaded, in cases of rapid workflow trigger executions.

Due to a bug in this new acknowledgment logic, some acknowledgments were not properly registered when facing production traffic. As a result, the system incorrectly assumed that the engine was at full capacity, and gradually reduced the amount of messages sent, even though there was spare capacity on the engine.

The system did not automatically notify the team about the issue, since the engine was not overloaded at the infrastructure level - the condition that would normally trigger an alert and page the on-call engineer.

Importantly:

  • No data or messages were lost.
  • Once the rollback was completed, all queued workflows were processed successfully, though with significant delays.

Actions to prevent recurrence and improve time to resolution

  1. Improve detection and alerting for slow message processing, even when the engine is still responsive: currently, the system only alerts if the workflow engine infrastructure is actually overloaded.
  2. Investigate a more resilient acknowledgment system. The acknowledgment system has been implemented to prevent the workflow engines overload in race condition scenarios; however, a new system must ensure that they are properly registered in all cases.
Posted Aug 18, 2025 - 13:07 CEST

Resolved

This incident has been resolved. Previously untriggered Orchestrator workflows in the queue have been processed and executed as expected.
Posted Aug 14, 2025 - 10:50 CEST

Monitoring

The engineers have resolved the issue causing Orchestrator workflows to remain “stuck”.
Previously untriggered workflows are now beginning to reprocess and should execute as expected.

Please note that while workflows are no longer stuck, the processing queue will take some time to work through the backlog. As a result, your workflow may not run immediately and could take several hours to complete. Thank you very much for your patience.
Posted Aug 13, 2025 - 12:01 CEST

Investigating

On August 12, 2025 5:45 PM CEST we began experiencing delays in the execution of Orchestrator workflows hosted in the EU Data Center. Our engineers are currently investigating the root cause. We apologize for any inconvenience this may have caused.
Posted Aug 13, 2025 - 10:16 CEST
This incident affected: Phrase Orchestrator (EU) (Legacy Workflow Engine).