--- title: "GitLab.com database incident" author: GitLab author_twitter: gitlab categories: engineering image_title: description: Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com. --- Update: please see [our postmortem for this incident](/blog/2017/02/10/postmortem-of-database-outage-of-january-31/) Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com. Git/wiki repositories and self-hosted installations were not affected. Losing production data is unacceptable and in a few days we'll publish a post on why this happened and a list of measures we will implement to prevent it happening again. _**Update 6:14pm UTC: GitLab.com is back online**_ As of time of writing, we’re restoring data from a six-hour-old backup of our database. This means that any data between 5:20pm UTC and 11:25pm UTC from the database (projects, issues, merge requests, users, comments, snippets, etc.) is lost by the time GitLab.com is live again. **Git data (repositories and wikis) and self-hosted instances of GitLab are not affected.** Read below for a brief summary of the events. You’re also welcome to view [our active postmortem doc](https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub). ## First incident At 2017/01/31 6pm UTC, we detected that spammers were hammering the database by creating snippets, making it unstable. We then started troubleshooting to understand what the problem was and how to fight it.  At 2017/01/31 9pm UTC, this escalated, causing a lockup on writes on the database, which caused some downtime.  ### Actions taken - We blocked the spammers based on IP address - We removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load) - We removed users for spamming (by creating snippets) ## Second incident At 2017/01/31 10pm UTC, we got paged because DB Replication lagged too far behind, effectively stopping. This happened because there was a spike in writes that were not processed ontime by the secondary database.   ### Actions taken - Attempt to fix `db2`, it’s lagging behind by about 4 GB at this point - `db2.cluster` refuses to replicate, `/var/opt/gitlab/postgresql/data` is wiped to ensure a clean replication - `db2.cluster` refuses to connect to `db1`, complaining about `max_wal_senders` being too low. This setting is used to limit the number of `WAL (= replication)` clients - _Team-member-1_ adjusts `max_wal_senders` to `32` on `db1`, restarts PostgreSQL - PostgreSQL complains about too many semaphores being open, refusing to start - _Team-member-1_ adjusts `max_connections` to `2000` from `8000`, PostgreSQL starts again (despite `8000` having been used for almost a year) - `db2.cluster` still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything - At this point frustration begins to kick in. Earlier this night _team-member-1_ explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden. ## Third incident At 2017/01/31 11pm-ish UTC, _team-member-1_ thinks that perhaps `pg_basebackup` is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on `db1.cluster.gitlab.com`, instead of `db2.cluster.gitlab.com`. At 2017/01/31 11:27pm UTC, _team-member-1_ - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left. We had to bring GitLab.com down and shared this information on Twitter:
We are performing emergency database maintenance, https://t.co/r11UmmDLDE will be taken offline
— GitLab.com Status (@gitlabstatus) January 31, 2017
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8
— GitLab.com Status (@gitlabstatus) February 1, 2017