CAT Editor: Segment persistence issue

Incident Report for Phrase

Postmortem

Introduction

We would like to share more details about the event that occurred with Phrase at 1:15 PM CEST on April 8th, 2025, and its effects, which gradually diminished until 0:11 AM CEST on April 9th, 2025. During this period, the functionality of the CAT editor was impacted, causing issues with the persistence of some users' segment adjustments. The issue affected only users who were actively using the editor at the time of deployment and didn't follow the notification to refresh their browser. Users who opened the editor after the deployment were not impacted in any case. In this report, we will outline the timeline, the root causes, and detail the measures we are taking to prevent this and similar issues from reoccurring.

Timeline

  • Apr 8, 2025 @ 13:22 CET - Go-live of a new CAT editor version. The persistence part of this version was unintentionally incompatible to the old client, which active sessions had in use.
  • Apr 8, 2025 @ 14:00 CET - Due to switching jobs or reloading the page, the affected users went down by 50%.
  • Apr 8, 2025 @ 14:29 CET - The Phrase Support team received the first report indicating a potential issue, however incoming reports coming in very low numbers.
  • Apr 8, 2025 @ 15:47 CET - As additional reports came in and resulting higher priority escalation to the development team and the team started to investigate the root cause.
  • Apr 8, 2025 @ 16:05 CET - The root cause was identify and the team assessed current impact on customers and mitigating actions; the support team started to advise users to reload the page.
  • Apr 8, 2025 @ 17:00 CET - 90% of affected users had resolved the issue.
  • Apr 8, 2025 @ 19:15 CET - Further hot-fixes were put on hold, due to the low remaining volume (4%) of affected sessions, in order to avoid further disruption.
  • Apr 9, 2025 @ 00:11 CET - The last instance of a failed segment save was observed.

Root Cause

Typically the editor team is running a ‘non breaking change' deployment approach, avoiding any kind interruption to users currently working, and thus enabling zero downtime deployments. However, a new version introduced a semantic breaking change in one of the APIs, which caused segment modifications - such as translating, editing pre-translated content, confirming, locking, and similar actions - to not be properly persistent in conjunction with the old client version.

This breaking change was also not caught by the automated testing suite, as the syntactic API contract was met. As the failure occurred only under specific semantic conditions, it didn't trigger general error messages: changes appeared successful in the UI but were not actually persisted. Because of this there was no visibility to the development team, as no runtime errors, logs, or monitoring alerts were triggered in the staging environment, and also no issues were reported by users during the 30-hour canary phase.

Users who opened the CAT editor after the deployment - or refreshed their session in response to the new version notification - experienced no disruption. Overall, according to our investigations a very minor percentage of all phrase customers and users have been affected.

Actions to Prevent Recurrence

  • The CAT editor team will implement additional monitoring and alerting to identify discrepancies between inbound and outbound data modifications.
  • A additional automated test step will be introduced to strengthen the quality and detect potential compatibility issues, especially those who don’t break the API contract during the development process.
Posted Apr 11, 2025 - 18:01 CEST

Resolved

A change in the new CAT editor version impacted the persistence of segment adjustments for users, who had the editor open during the deployment and didn't refresh the page.
Posted Apr 08, 2025 - 13:00 CEST