Outage of the connector service and APC component for Phrase TMS (EU) and Phrase TMS (US) between March 15 08:00 PM CET and March 17 04:49 AM CET
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 8:00 PM CEST on March 15, 2024 and 4:49 AM CEST on March 17, 2024 which led to a gradual outage of the connector service, APCs and CJs, and what Phrase engineers are doing to prevent these issues from happening again.

Timeline

8:00 PM CEST on March 15, 2024: Deployed wrong configuration to production.

7:30 PM CEST on March 16, 2024: First connector instance became unavailable.

0:00 AM CEST on March 17, 2024: Second connector instance became unavailable.

2:35 AM CEST on March 17, 2024: Incident created.

4:50 AM CEST on March 17, 2024: Manually created mount target; connector instances became available.

10:34 PM CEST on March 18, 2024: Permanent long term solution prepared and applied.

Root Cause

The root cause of the incident was the inadvertent removal of the mount target associated with the Amazon Elastic File System (EFS) volume during the cleanup process following the migration from Amazon EC2 to Amazon EKS. This occurred due to the existing implementation of the EFS module which determined the mount target subnet based on the subnet of the provisioned EC2 instance. With the removal of the EC2 instances during the cleanup process, the mount target subnet information was no longer valid leading to the unintentional removal of the mount target. As a result, the EFS volume and its access point remained intact but inaccessible due to the absence of the necessary mount target causing connector service outage.

Actions to Prevent Recurrence

Alter the existing EFS module to ensure keeping mount target resources even without explicitly running an EC2 instance.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Mar 25, 2024 - 13:29 CET

Resolved
Phrase TMS (EU) and Phrase TMS (US) suffered a gradual outage of the connector service and APC between March 15 08:00 PM CET and March 17 04:49 AM CET
Posted Mar 17, 2024 - 13:21 CET