Degraded Performance of File Processing component in Phrase TMS (EU) between 09:58 and 10:26 AM CEST

Incident Report for Phrase

Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 09:50 AM CEST and 10:26 AM CEST on April 02, 2024 which led to a degraded performance of the file processing component and what Phrase engineers are doing to prevent these issues from reoccurring.

Timeline

09:09 AM CEST: A new version of the service managing asynchronous calls is deployed. It uses a special service for storing files which scans the input files for malware now.

09:50 AM CEST: The number of input files grows and engineers notice the increasing error rate.

09:59 AM CEST: First low priority alert about the increased error rate is triggered.

10:00 AM CEST: First high priority alert about the increased error rate is triggered.

10:12 AM CEST: The engineers reveal lack of resources in the malware scan component and start scaling it up.

10:26 AM CEST: The capacity of the scanning component is increased and the incident is fully resolved.

‌

Root Cause

The component scanning the input files for malware had not enough resources for the peak load, especially RAM.

Actions to Prevent Recurrence

Any new service will be employed more gradually in the production environment while observing the load and the overall system behavior.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks to improve their understanding of the incident and determine what changes to make that improve our services and processes.

Posted Apr 03, 2024 - 15:05 CEST

Resolved

Between 09:58 AM and 10:26 AM CET we experienced a Degraded Performance of the File Processing component in Phrase TMS (EU). The issue prevented users from creating jobs. It has been resolved and the service has been working as expected again since 10:26 AM CET.

Posted Apr 02, 2024 - 09:58 CEST