We would like to share more details about the events that occurred with Phrase between 09:50 AM CEST and 10:26 AM CEST on April 02, 2024 which led to a degraded performance of the file processing component and what Phrase engineers are doing to prevent these issues from reoccurring.
09:09 AM CEST: A new version of the service managing asynchronous calls is deployed. It uses a special service for storing files which scans the input files for malware now.
09:50 AM CEST: The number of input files grows and engineers notice the increasing error rate.
09:59 AM CEST: First low priority alert about the increased error rate is triggered.
10:00 AM CEST: First high priority alert about the increased error rate is triggered.
10:12 AM CEST: The engineers reveal lack of resources in the malware scan component and start scaling it up.
10:26 AM CEST: The capacity of the scanning component is increased and the incident is fully resolved.
The component scanning the input files for malware had not enough resources for the peak load, especially RAM.
Any new service will be employed more gradually in the production environment while observing the load and the overall system behavior.
Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks to improve their understanding of the incident and determine what changes to make that improve our services and processes.