---
layout: handbook-page-toc
title: "Design: Storage Nodes"
---

## On this page
{:.no_toc .hidden-md .hidden-lg}

- TOC
{:toc .hidden-md .hidden-lg}

Issue: [`infra/5088`](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5088)

# Idea/Problem Statement

We currently have a fleet of 24 storage nodes where Git data is stored, which requires access to
a POSIX-compliant file system. These nodes have about 16TB of storage each and they run on EXT4
file systems. While we use mirroring to address potential faults on the storage subsystem of these
nodes (e.g., for underlying device failures), we do not have a strategy to protect against other
failure modes (e.g., file system corruption, involuntary deletions) or disaster recovery (e.g.,
complete loss of data, tho Geo does cover this). We also lack the ability to use the data in these
systems for testing in a safe manner, which is desirable to be able to validate GitLab against its
large data set.

# Design

GitLab's storage architecture for Git data implements a simple and boring approach by using standalone
storage nodes running the Gitaly service. At a very basic level, the application uses a project lookup
table to determine which storage node contains a given project. This approach is highly performant
(especially once NFS was removed), avoids some of the inherent complexities of running a distributed
file system, and keeps the design and its associated components simple and manageable.

The proposal entails switching the storage nodes to use the ZFS file system, which is a mature file 
system and logical volume manager with snapshot, clone and asynchronous replication capabilities. ZFS
is designed with a focus on data integrity and management simplicity. The adoption of ZFS also affords
us independence from specific cloud provider features, which is a significant factor for self-managed
installations.

## Addressing Requirements

#### File System Corruption

ZFS implements data-protection features such as  end-to-end checksums, data replication, and 
transactional updates, which ensure data integrity. While the file system on disk is always consistent, 
ZFS regularly scrubs disks in search of inconsistencies, most of which are silently fixed through the
use of ZFS checksums. ZFS negates the need to run fsck against a file system, which in large file 
systems can take considerable amounts of time.

#### Unwanted Deletions

The use of regularly and frequently scheduled snapshots can be used to protect against unwanted 
deletions. Depending on the circumstances of the deletion, data can be recovered selectively by
manual copying from a snapshot to a live file system, or a live file system can be rolled back 
to a specific snapshot. Snapshots are instantaneous, and prior experience has shown ZFS’s ability
to hold hundreds or thousands of snapshots in a single file system (available storage permitting). 
While cloud providers offer snapshot capabilities, these work below the file system, and thus 
cannot guarantee the consistency of the file system when snapshots are taken (which could be worked
around with tooling).

#### Disaster Recovery

ZFS has asynchronous remote-replication capabilities built-in, through the use of rolling snapshots, 
efficiently sending deltas between snapshots. These capabilities can be used to replicate the contents
of the storage nodes elsewhere. The initial copy is, of course, expensive, as the file system has to
be transferred entirely. Additional iterations are significantly faster. Geo provides this functionality,
but the adoption of ZFS implies we can benefit from the experience, and its availability on the storage
nodes themselves can be useful in extreme cases.

#### Testing

There is a strong desire to be able to do testing with production data. ZFS clones can be used to 
provide this data for testing in a safe fashion as follows: a snapshot of a file system is created, and
from this snapshot, a clone can be instantiated. The live file system and the clone share the data 
contained in the clone. The clone stores changes to the clone itself (additions, deletions, updates) 
without affecting the data in the live file system. When testing is completed, the snapshot and the 
clone can be discarded, returning the storage they consumed during their use.

## Architecture

#### Anatomy of a Storage Node

A storage node is a pool of POSIX-compliant storage with Gitaly running on top of it to perform 
Git operations against repositories hosted on said node. Repositories at the disk level are grouped
in shards. Currently, there is one shard per host, but a Gitaly server can support multiple shards.
A single ZFS storage pool in each storage node would contain one ZFS file system per Gitaly shard.
If we ever decided to run multiple shards, each shard would be assigned a ZFS file system. These
file systems share the storage in the underlying pool. A ZFS file system is the basic ZFS operating 
unit against which snapshots and clones be created, and remote replication accomplished. These 
operations are simply accomplished through the use of the `zpool` and `zfs` commands.

## Implementation Considerations

#### Testing

We should perform proper benchmarking of ZFS on GCP through the use of tools like FileBench, which
allow us to model workloads. Additionally, we should test:

 * Underlying device failures to gather data on resilvering speeds and throttling
 * Model unwanted deletions to test our ability to restore selectively or at the file system level
 * Setup remote replication on a desired target to measure latencies, lags, and speeds
 * Setup snapshots and clones to model the creation and teardown of testing environments
 * Setup a parallel staging environment where we can start measuring the performance for the database and the staging file servers.
 * Setup a dedicated shard for a subset of internal repositories before we role this out for new customers.
 * Compare latency histograms for real-world workloads from ext4 control and zfs experimental volumes (Proposal: eBPF + Prometheus eBPF exporter)

#### GitLab.com and Self-Managed

The adoption of ZFS allows us to maintain independence from any cloud provider, so it is entirely 
feasible to use ZFS in both GitLab.com and self-managed installations. One important aspect to 
consider is how to migrate existing installations from their current file systems to ZFS.

## Operational Considerations

#### Automation

ZFS Chef cookbooks exist for ZFS on Linux: https://github.com/biola/chef-zfs_linux, and we will
likely have to invest in automation to manage ZFS file systems in relation to Gitaly shards. 
Tooling will be necessary to manage snapshots and clones, especially in relation to testing 
environments. Project recovery tooling (from snapshots) is also necessary to ease the process.

#### Monitoring

Monitoring ZFS is not unlike monitoring other file systems (latency and iowait being some of the
most important metrics to keep track of). Additionally, ZFS will produce events when devices fail. 
Metrics on scrub and resilver runs should be collected. If we were to use remote replication, lag
times are critical in meeting SLAs, which need to be established.

## Additional Considerations

In the past, questions have been raised related to ZFS license, which is CDDL, with products under
the GPLv2 (such as the Linux kernel). Canonical took the first step towards addressing this issue in
2016, shipping ZFS as a Linux module. Some references regarding this issue include:

 * [*ZFS Licensing and Linux*](http://blog.dustinkirkland.com/2016/02/zfs-licensing-and-linux.html)
 * [*Are GPLv2 and CDDL incompatible?*](https://blog.hansenpartnership.com/are-gplv2-and-cddl-incompatible/)

The Software Freedom Conservancy has an opposing viewpoint on this matter:

 * [GPL Violations Related to Combining ZFS and Linux](https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/)

# Alternatives

Alternatives considered include:

#### LVM Snapshots (ext4, XFS)

LVM provides support for the creation of logical volume snapshots. As such, these take place below
the file system level, which requires the file system be temporarily quiesced so as to be able to take
a snapshot of the file system in a consistent state. Otherwise, the block-level snapshot may contain 
in-flight file system updates, yielding a potentially unstable file system. LVM does this automatically. 
A number of file systems can be frozen with the `fsfreeze` utility.

This methodology temporarily suspends I/O to the file system, creating a potential write backlog that
the storage subsystem must resolve once the snapshot has been taken and the file system thawed. Snapshot
creation is quick, so this should not generally be a significant issue.

As ZFS provides fully integrated volume manager and file system functionality, this process is far simpler
and less disruptive. In some respects, ZFS adheres to the principle of building the simplest and most 
boring solution for the problem at hand, with the additional benefit of providing added functionality 
(such as end-to-end checksums, clones and remote replication) that is highly desirable and fully integrated. 

#### Ceph

>The Ceph Filesystem (CephFS) is a POSIX-compliant filesystem that uses a Ceph Storage Cluster to store
its data. The Ceph filesystem uses the same Ceph Storage Cluster system as Ceph Block Devices, Ceph Object
Storage with its S3 and Swift APIs, or native bindings (librados).

Ceph is an entirely different beast from single-host file systems such as XFS or ZFS. In some respects,
it creates a monolith at the storage layer through a distributed file system (and Ceph does this in elegant
ways). It is far from a simple solution, and some of its core functionality (such as object storage) is 
simply of no use to GitLab, given Git requires POSIX semantics. It is a non-trivial solution for something
that the Gitaly architecture already solves (project sharding).