Performance Disruption of Most of the Phrase Strings (EU) Components between 3:06 and 3:30 PM CEST
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 03:02 PM CEST and 03:32 PM CEST on April 9, 2024 which led to performance disruption of most of the Phrase Strings (EU) components and what Phrase engineers are doing to prevent these issues from reoccurring.

Timeline

11:13 AM CEST on April 5, 2024: Network configuration updated to enable peering between Phrase Strings EU and Analytics component.

03:02 PM CEST on April 9, 2024: Network configuration updated to propagate routes between Phrase Strings EU and Analytics component by Analytics operations team.

03:06 PM CEST on April 9, 2024: Critical alerts triggered for Phrase Strings EU not available. On-duty engineers start to investigate.

03:22 PM CEST on April 9, 2024: Root cause identified as related to previous network changes. Reconstruction of the original route table is started.

03:32 PM CEST on April 9, 2024: Route table reconstructed with routes enabling all traffic as before the disruption. Systems are immediately restored to the correct state. The engineering team continues to monitor the situation.

Root Cause

The root cause of the incident was the removal of routes and dissociation of route tables between the Phrase Strings EU application and the database layer, as they are deployed in separate virtual physical networks. This change was part of a new feature rollout done by the team that maintains different components.

The standard procedure including review and notification of teams impacted by the network change was not fully followed, resulting in increased investigation time as the incident response team was informed about the change during the actual incident time.

Actions to Prevent Recurrence

Network change procedures will be enhanced to improve communication throughout the process. Network maintenance teams will receive specialized training to ensure compliance with these updated protocols.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks to improve their understanding of the incident and determine what changes to make that improve our services and processes.

Posted Apr 22, 2024 - 07:20 CEST

Resolved
The incident has been resolved.
Posted Apr 09, 2024 - 15:43 CEST
Update
We are continuing to monitor for any further issues.
Posted Apr 09, 2024 - 15:42 CEST
Monitoring
The fix has been implemented and we are monitoring the results.
Posted Apr 09, 2024 - 15:34 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 09, 2024 - 15:33 CEST
Investigating
We are investigating the issue.
Posted Apr 09, 2024 - 15:14 CEST
This incident affected: Phrase Strings (EU) (Translation center, GitLab sync, Bitbucket sync, OTA, Email delivery, Ordering, In-context editor, API).