Degraded Performance of Phrase Orchestrator (EU) Next-Gen Workflow Engine between February 05, 05:38 PM CEST and February 05, 08:13 PM CEST

Incident Report for Phrase

Postmortem

Introduction

We would like to share more details about the workflow disruption that occurred on February 5, 2026, between 15:01 CET and 19:58 CET, which led to workflows being held in an “Executing” state and delays in workflow processing within Phrase Orchestrator. This affected workflows running on the new engine in the EU data center.

During this time, newly triggered workflows were not progressing as expected. Below we describe what happened, the root cause, and the steps we are taking to prevent similar incidents in the future.

Timeline

Feb 5, 2026

  • 16:40 – A Phrase employee reported that workflows for an organization were stuck in “Executing” status.
  • 17:05 – Phrase engineers began investigating the issue: It was observed that the component that acts as an entry point for all workflows was unstable
  • 17:10 To decrease the load, the engineers decreased the amout of parallel workflow tasks
  • 17:41 The analysis identified a degraded communication with the the database
  • 19:28 Following the implementation of multiple load-relief and scalability strategies, engineers determined that the issue was solely caused by a small number of extraordinarily large workflows.
  • 19:58 After ruling out negative side-effects, the workflows associated with the excessive load were cancelled.
  • 20:01 Workflows resumed normal execution.
  • 20:12 Previously stuck workflow executions for the affected organizations were marked as “Failed” to ensure system consistency.

Root Cause

The incident was triggered by the execution of a very large workflow with a complex dependency structure. A workflow of this design results in rapid expansion of tasks when processing large payloads.

On February 5, a workflow of this nature was triggered multiple times. Due to the complex dependency structure, this resulted in the creation of more than 140,000 tasks over a short period of time.

Most of these tasks executed requests against a single API, where the API rate limit was reached. This caused a large number of tasks to retry repeatedly.

At the same time, the workflow engine had to evaluate the state of many dependent tasks within the workflow graph. The system executes complex database queries to determine the state of scheduled workflow jobs and their dependencies. With tens of thousands of jobs and large dependency trees, these queries became increasingly slow.

The combination of …

  • A very high number of generated jobs
  • Frequent retries due to APIs and hitting rate limiting
  • Complex dependency evaluation for large workflows

… led to long-running database queries. These slow queries exhausted available database connections and caused repeated crashes and restarts of the multiple workflow engine components. One of the affected components was the entry point for all workflows - its instability temporarily prevented other workflows from progressing.

Once the tasks related to the complex workflows were cancelled, the database load immediately decreased and normal processing resumed.

Actions to Prevent Recurrence

We are taking the following steps to prevent similar incidents in the future:

  • Improved Alerting: We are improving our alerting mechanisms to ensure we are notified more quickly when workflows stop progressing. This also includes improved visibility around system exceptions.
  • Enhanced Monitoring: We are expanding our monitoring around workflow executions to more quickly identify large workloads that may clog the system.
  • Workflow Architecture Redesign: We are redesigning how workflows with complex dependency structures are handled. Complex workflow segments will be encapsulated into separate entities, reducing dependency tree complexity and improving overall processing efficiency.
  • Dedicated Database Connection: We are separating the workflow engine from the main application at the database connection level. The engine will use a dedicated connection with appropriate capacity, improving flexibility and ensuring better isolation between components.
  • Improved API Rate Limit Handling: We are enhancing how we handle APIs and how we react when rate limits get reached.
  • Faster Mitigation: We are establishing tooling to more quickly resolve workflow executions that are no longer progressing as expected.

We sincerely apologize for the disruption caused by this incident. We are committed to improving the resilience and predictability of our workflow processing system and appreciate the feedback and patience of our customers.

Posted Feb 19, 2026 - 10:33 CET

Resolved

This incident has been resolved.
Posted Feb 05, 2026 - 20:47 CET

Update

We are continuing to monitor for any further issues.
Posted Feb 05, 2026 - 20:20 CET

Monitoring

A fix has been implemented and the system is currently stable. We are continuing to monitor the situation.
Posted Feb 05, 2026 - 20:20 CET

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 05, 2026 - 19:59 CET

Update

We are continuing to investigate this issue.
Posted Feb 05, 2026 - 18:23 CET

Investigating

Engineering has identified an issue with Orchestrator where the new Workflow Engine is currently not executing workflows. The problem is under investigation.
Posted Feb 05, 2026 - 17:56 CET
This incident affected: Phrase Orchestrator (EU) (Legacy Workflow Engine).