July 23, 2024
We would like to share more details about the events that occurred with Phrase between July 23, 2024 09:07 AM CEST and July 23, 2024 09:51 AM CEST which led to a performance disruption of the IDM (EU) component and what Phrase engineers are doing to prevent these issues from reoccurring.
23/7/2024 09:06 AM CEST: Staged deployment to PROD US was deployed without any issues and validated as fully functional
23/7/2024 09:07 AM CEST: Staged deployment to PROD EU started
23/7/2024 09:10 AM CEST: K8S platform canceled the deployment due to a timeout, leaving the Production DB not fully migrated
23/7/2024 09:12 AM CEST: Health checks indicate issues in Production EU
23/7/2024 09:13 AM CEST: The team responsible for the application starts fixing the deployment
23/7/2024 09:28 AM CEST: Half of the production nodes are online but face significant load, causing them to restart
23/7/2024 09:51 AM CEST: Production EU is fully functional
The issue was caused by database migrations in Production EU that took longer than expected (adding indexes to various tables). Since the team uses Liquibase, the canceled deployment left the Liquibase lock set to ‘locked’, preventing auto-retry.
In the meantime, the existing pods started to fail due to unfinished DB migration and a slightly different DB model. This state triggered a pod restart, which also failed due to a Liquibase lock.
As a countermeasure, the lock was manually switched off, allowing new deployments to finish.
Indexes, unique constraints, and other potentially time-consuming database schema changes are excluded from automatic DB migrations.