Introduction
We would like to share more details about the workflow disruption that occurred on February 5, 2026, between 15:01 CET and 19:58 CET, which led to workflows being held in an “Executing” state and delays in workflow processing within Phrase Orchestrator. This affected workflows running on the new engine in the EU data center.
During this time, newly triggered workflows were not progressing as expected. Below we describe what happened, the root cause, and the steps we are taking to prevent similar incidents in the future.
Timeline
Feb 5, 2026
- 16:40 – A Phrase employee reported that workflows for an organization were stuck in “Executing” status.
- 17:05 – Phrase engineers began investigating the issue: It was observed that the component that acts as an entry point for all workflows was unstable
- 17:10 To decrease the load, the engineers decreased the amout of parallel workflow tasks
- 17:41 The analysis identified a degraded communication with the the database
- 19:28 Following the implementation of multiple load-relief and scalability strategies, engineers determined that the issue was solely caused by a small number of extraordinarily large workflows.
- 19:58 After ruling out negative side-effects, the workflows associated with the excessive load were cancelled.
- 20:01 Workflows resumed normal execution.
- 20:12 Previously stuck workflow executions for the affected organizations were marked as “Failed” to ensure system consistency.
Root Cause
The incident was triggered by the execution of a very large workflow with a complex dependency structure. A workflow of this design results in rapid expansion of tasks when processing large payloads.
On February 5, a workflow of this nature was triggered multiple times. Due to the complex dependency structure, this resulted in the creation of more than 140,000 tasks over a short period of time.
Most of these tasks executed requests against a single API, where the API rate limit was reached. This caused a large number of tasks to retry repeatedly.
At the same time, the workflow engine had to evaluate the state of many dependent tasks within the workflow graph. The system executes complex database queries to determine the state of scheduled workflow jobs and their dependencies. With tens of thousands of jobs and large dependency trees, these queries became increasingly slow.
The combination of …
- A very high number of generated jobs
- Frequent retries due to APIs and hitting rate limiting
- Complex dependency evaluation for large workflows
… led to long-running database queries. These slow queries exhausted available database connections and caused repeated crashes and restarts of the multiple workflow engine components. One of the affected components was the entry point for all workflows - its instability temporarily prevented other workflows from progressing.
Once the tasks related to the complex workflows were cancelled, the database load immediately decreased and normal processing resumed.
Actions to Prevent Recurrence
We are taking the following steps to prevent similar incidents in the future:
- Improved Alerting: We are improving our alerting mechanisms to ensure we are notified more quickly when workflows stop progressing. This also includes improved visibility around system exceptions.
- Enhanced Monitoring: We are expanding our monitoring around workflow executions to more quickly identify large workloads that may clog the system.
- Workflow Architecture Redesign: We are redesigning how workflows with complex dependency structures are handled. Complex workflow segments will be encapsulated into separate entities, reducing dependency tree complexity and improving overall processing efficiency.
- Dedicated Database Connection: We are separating the workflow engine from the main application at the database connection level. The engine will use a dedicated connection with appropriate capacity, improving flexibility and ensuring better isolation between components.
- Improved API Rate Limit Handling: We are enhancing how we handle APIs and how we react when rate limits get reached.
- Faster Mitigation: We are establishing tooling to more quickly resolve workflow executions that are no longer progressing as expected.
We sincerely apologize for the disruption caused by this incident. We are committed to improving the resilience and predictability of our workflow processing system and appreciate the feedback and patience of our customers.