Performance Disruption of Phrase Identity Management (EU) Component between July 23, 2024 09:07 AM CEST and July 23, 2024 09:51 AM CEST
Incident Report for Phrase
Postmortem

Root Cause Analysis

July 23, 2024

Introduction

We would like to share more details about the events that occurred with Phrase between July 23, 2024 09:07 AM CEST and July 23, 2024 09:51 AM CEST which led to a performance disruption of the IDM (EU) component and what Phrase engineers are doing to prevent these issues from reoccurring.

Timeline

23/7/2024 09:06 AM CEST: Staged deployment to PROD US was deployed without any issues and validated as fully functional

23/7/2024 09:07 AM CEST: Staged deployment to PROD EU started

23/7/2024 09:10 AM CEST: K8S platform canceled the deployment due to a timeout, leaving the Production DB not fully migrated

23/7/2024 09:12 AM CEST: Health checks indicate issues in Production EU

23/7/2024 09:13 AM CEST: The team responsible for the application starts fixing the deployment

23/7/2024 09:28 AM CEST: Half of the production nodes are online but face significant load, causing them to restart

23/7/2024 09:51 AM CEST: Production EU is fully functional

Root Cause

The issue was caused by database migrations in Production EU that took longer than expected (adding indexes to various tables). Since the team uses Liquibase, the canceled deployment left the Liquibase lock set to ‘locked’, preventing auto-retry. 

In the meantime, the existing pods started to fail due to unfinished DB migration and a slightly different DB model. This state triggered a pod restart, which also failed due to a Liquibase lock. 

As a countermeasure, the lock was manually switched off, allowing new deployments to finish.

Actions to Prevent Recurrence

Indexes, unique constraints, and other potentially time-consuming database schema changes are excluded from automatic DB migrations.

Posted Jul 24, 2024 - 12:25 CEST

Resolved
Our engineers resolved this issue. Users can access the Phrase Platform (EU) at eu.phrase.com
Posted Jul 23, 2024 - 09:54 CEST
Investigating
Our engineers are currently investigating an issue with Phrase Identity Management (EU) component.
Posted Jul 23, 2024 - 09:32 CEST
This incident affected: Identity management - IDM (EU).