---
layout: markdown_page
title: "Category Strategy - Disaster Recovery"
---

- TOC
{:toc}

## 🚨 Disaster Recovery

### Introduction and how you can help

* [Overall Strategy](/direction/geo)
* [Roadmap for Disaster Recovery](https://gitlab.com/groups/gitlab-org/-/roadmap?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=group%3A%3Ageo&label_name[]=geo%3A%3Aactive&label_name[]=Category%3ADisaster%20Recovery)
* [Maturity: <%= data.categories["disaster_recovery"].maturity.capitalize %>](/direction/maturity)
* [Documentation](https://docs.gitlab.com/ee/administration/geo/disaster_recovery/)
* [Viable Maturity epic](https://gitlab.com/groups/gitlab-org/-/epics/1507)
* [All Epics](https://gitlab.com/groups/gitlab-org/-/epics?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=group%3A%3Ageo&label_name[]=Category%3ADisaster%20Recovery)

GitLab installations hold business critical information and data. The Disaster
Recovery (DR) category helps our customers fulfill their business continuity
plans by creating processes that allow the recovery of a GitLab instance following a
natural or human-created disaster. Disaster Recovery complements GitLab's [High
Availability
solution](https://about.gitlab.com/solutions/high-availability/) and
utilizes [Geo nodes](https://docs.gitlab.com/ee/administration/geo/replication/)
to enable a failover in a disaster situation. We want disaster recovery to be
robust and easy to use for [systems
administrators](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator) - especially in a potentially stressful recovery situation.

Please reach out to Fabian Zimmer, Product Manager for the Geo group
([Email](mailto:fzimmer@gitlab.com)) if you'd like to provide feedback or ask
any questions related to this product category.

This strategy is a work in progress, and everyone can contribute:

 - Please comment and contribute in the linked
   [issues](https://gitlab.com/groups/gitlab-org/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=group%3A%3Ageo&label_name[]=Category%3ADisaster%20Recovery)
   and [epics](https://gitlab.com/groups/gitlab-org/-/epics?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=group%3A%3Ageo&label_name[]=Category%3ADisaster%20Recovery)
   on this page. Sharing your feedback directly on GitLab.com is the best way to
   contribute to our strategy and vision.

### Current state
⚠️ Currently, there are [some
limitations](https://docs.gitlab.com/ee/administration/geo/replication/index.html#current-limitations)
of what data is replicated. Please make sure to check the documentation!

Setting up a disaster recovery solution for GitLab requires significant
investment and is cumbersome in more complex setups, such as high availability
configurations. [Geo doesn't replicate all parts of GitLab
yet](https://gitlab.com/groups/gitlab-org/-/epics/893), which means that users
need to be aware of what is automatically covered by replication via a [Geo node](https://docs.gitlab.com/ee/administration/geo/replication/)
and what parts need to be backed up separately.

### Where we are headed
In the future, our users should be able to use a GitLab Disaster Recovery
solution that fits within their business continuity plan. Users should be able
to choose which Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
are acceptable to them and GitLab's DR solution should provide configurations
that fit those requirements.

A systems administrator should be able to confidently setup a DR solution even
when the setup is complex, as is the case for high availability. In case of an
actual disaster, a systems administrator should be able to follow a simple and
clear set of instructions that allows them to recover a working GitLab
installation. In order to ensure that DR works, frequent planned failovers should be
tested.

We envision that GitLab's Disaster Recovery processes and solution should
* cover different scenarios based on acceptable Recovery Time Objective (RTO) and Recovery Point Objective (RPO). There is always a trade off between the complexity of the system needed given the requirements in a disaster recovery. GitLab's DR strategies should make this explicit to users.
* clearly define which data is replicated and why it is relevant for customers.
* by default allow the recovery of *all* customer relevant data that was available on the production instance. Users should not need to think about caveats or exclusions.
* be as simple to execute as possible. All instructions should fit on on one laptop screen (< 10 steps) that are linear and easy to follow.
* allow for planned failover testing that ensure DR is fully functional.
* integrate into a more holistic approach that includes High Availability and Geo-distributed configurations.
* be complemented by monitoring that can detect a potential disaster.
* be actively used on GitLab.com to ensure that all best practices are followed and to ensure that we dogfood our own solutions.
* scale from small installations with hundreds of users to extremely large installations with millions of users.

### Target audience and experience

#### [Sidney - (Systems Administrator)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator)
* 🙂 **Minimal** - Sidney can manually configure a DR solution using Geo nodes. More complex configurations, such as HA, are supported but are highly manual to set up. Some data may not be replicated. Failovers are manual.
* 😊 **Viable** - Sidney can follow a set of clearly defined procedures for planned failovers. DR is available for single node configurations and HA configurations are fully supported. All data is replicated.
* 😁 **Complete** - Sidney can choose between different configurations that clearly link back to suggested RTO and RPO requirements. Configuration is simple and all solutions are constantly monitored. A dashboard informs users of the current status. A recovery process is less than <10 steps.
* 😍 **Lovable** - Automatic failovers are supported.

For more information on how we use personas and roles at GitLab, please [click
here](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/).

### What's Next & Why
Please note that we are working in parallel on accelerating [how new datatypes
can be added to Geo
nodes](https://about.gitlab.com/direction/geo/geo_replication/#building-a-self-service-geo-framework);
this impacts DR but is part of the [Geo-replication category](https://about.gitlab.com/direction/geo/geo_replication/).

#### Enable Geo on GitLab.com
GitLab.com is by far the largest GitLab instance and is used by GitLab to
[dogfood GitLab
itself](https://about.gitlab.com/handbook/engineering/index.html#dogfooding).
Currently, GitLab.com does not use GitLab Geo for DR purposes. This has many
disadvantages and the Geo Team is working with Infrastructure to enable Geo on
GitLab.com. We are currently [pursuing enabling it on staging
first.](https://gitlab.com/groups/gitlab-org/-/epics/1908)

#### Improving the planned failover process
We want DR processes to be simpler and believe that [improving the planned failover process](https://gitlab.com/groups/gitlab-org/-/epics/2148) is the
best way to start improving this process. DR procedures should be tested regularly and we are
aiming to provide better support for this process. A simple example for a planned failover process
would be:

* Activate maintenance mode (on all nodes)
* Wait for primary and secondary to be fully in sync
* Pause any further replication
* Promote secondary to primary
* Re-point DNS
* End maintenance mode

#### Add a GitLab maintenance mode
As stated above, part of a planned failover process is usually putting your instance
in [a maintenance mode](https://gitlab.com/groups/gitlab-org/-/epics/2149). This would block any write operations and would allow a primary
and secondary to be fully in sync before making the switch. Additionally, a maintenance
period may be useful in other situations e.g. during upgrades or other infrastructure changes.

#### Replication should be easy to pause and resume
DR depends on PostgreSQL streaming replication via a Geo node right now. It should
be easy [to pause and resume the database replication](https://gitlab.com/groups/gitlab-org/-/epics/2159) during a planned failover or upgrade event.

### What is not planned right now
We currently don't plan to replace PostgreSQL with a different database e.g.
CockroachDB.

### Maturity plan
This category is currently at the <%=
data.categories["disaster_recovery"].maturity %> maturity level, and our next
maturity target is viable (see our [definitions of maturity
levels](/direction/maturity)).

In order to move this category from  <%=
data.categories["disaster_recovery"].maturity %> to viable, one of the main
initiatives is to create a simplified disaster recovery process, enable DR via
Geo on GitLab.com and to add a maintenance mode. You can track the work in the
[viable maturity epic](https://gitlab.com/groups/gitlab-org/-/epics/1507).

### Competitive landscape
We have to understand the current DR landscape better and we are actively
engaging with customers to understand what features are required to move
the DR category forward.

### Analyst landscape
<!--  What analysts and/or thought leaders in the space talking about, what are one or two issues
that will help us stay relevant from their perspective.-->
We do need to interact more closely with analysts to understand the landscape
better.

### Top customer success/sales issue(s)
<!-- These can be sourced from the CS/Sales top issue labels when available, internal
surveys, or from your conversations with them.-->

* https://gitlab.com/groups/gitlab-org/-/epics/893

### Top user issues
<!-- This is probably the top popular issue from the category (i.e. the one with the most
thumbs-up), but you may have a different item coming out of customer calls.-->

* [Category issues listed by popularity](https://gitlab.com/groups/gitlab-org/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=group%3A%3Ageo&label_name[]=Category%3ADisaster%20Recovery)

### Top internal customer issues/epics
<!-- These are sourced from internal customers wanting to [dogfood](/handbook/values/#dogfooding)
the product.-->

* [Geo for DR on GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/12)

### Top strategy item(s)
<!--  What's the most important thing to move your vision forward?-->

- [Improving the planned failover process](https://gitlab.com/groups/gitlab-org/-/epics/2148)
- [Create a maintenance / read-only mode](https://gitlab.com/groups/gitlab-org/-/epics/2149)