--- layout: handbook-page-toc title: "ZFS Filesystem" --- ## On this page {:.no_toc .hidden-md .hidden-lg} - TOC {:toc .hidden-md .hidden-lg} ## Idea/Problem Statement ### Storage Nodes We currently have a fleet of 24 storage nodes where Git data is stored, which requires access to a POSIX-compliant file system. These nodes have about 16TB of storage each and they run on EXT4 file systems. While we use device snapshots to address potential faults on the storage subsystem of these nodes (e.g., for underlying device failures), we do not have a strategy to protect against other failure modes (e.g., file system corruption, involuntary deletions) or disaster recovery (e.g., complete loss of data, tho Geo does cover this). We also lack the ability to use the data in these systems for testing in a safe manner, which is desirable to be able to validate GitLab against its large data set. ### Database Nodes We also have a fleet of PostgreSQL database nodes that contain the heart of the GitLab application. A better file system is needed to accommodate realtime snapshots of the data that is application aware, quicker transfer of the database data to staging and testing systems via block-transfer method, increased performance by leveraging tiered caching structures for frequently used data items. ## Design GitLab's storage architecture for Git data implements a simple and boring approach by using standalone storage nodes running the Gitaly service. At a very basic level, the application uses a project lookup table to determine which storage node contains a given project. This approach is highly performant (especially once NFS was removed), avoids some of the inherent complexities of running a distributed file system, and keeps the design and its associated components simple and manageable. The proposal entails switching the storage nodes (Git and DB) to use the ZFS file system, which is a mature file system and logical volume manager with snapshot, cache, clone and asynchronous replication capabilities. ZFS is designed with a focus on data integrity and management simplicity. The adoption of ZFS also affords us independence from specific cloud provider features, which is a significant factor for self-managed installations. ## Addressing Requirements ### File System Corruption ZFS implements data-protection features such as end-to-end checksums, data replication, and atomic transactional updates, which ensure data integrity. While the file system on disk is always consistent, ZFS regularly scrubs disks in search of inconsistencies, most of which are silently fixed through the use of ZFS checksums. ZFS negates the need to run FSCK against a file system, which in large file systems can take considerable amounts of time (on the order of hours). ### Unwanted Deletions The use of regularly and frequently scheduled snapshots can be used to protect against unwanted deletions. Depending on the circumstances of the deletion, data can be recovered selectively by manual copying from a snapshot to a live file system, or a live file system can be rolled back to a specific snapshot. Snapshots are instantaneous, and prior experience has shown ZFS’s ability to hold hundreds or thousands of snapshots in a single file system (available storage permitting). While cloud providers offer snapshot capabilities, these work below the file system, and thus cannot guarantee the consistency of the file system when snapshots are taken (which could be worked around with tooling). ZFS' snapshot capability is based upon Copy on Write (CoW) methods to reduce duplication and snapshot size. ### Disaster Recovery ZFS has asynchronous remote-replication capabilities built-in, through the use of rolling snapshots, efficiently sending deltas between snapshots. These capabilities can be used to replicate the contents of the storage nodes elsewhere. The initial copy is, of course, time consuming, as the file system has to be transferred entirely. Additional iterations are significantly faster. Geo provides this functionality, but the adoption of ZFS implies we can benefit from the experience, and its availability on the storage nodes themselves can be useful in extreme cases. Snapshots and asynchronous transfer all happen with next to zero overhead from the file node itself, resulting in a more resilient replication strategy without sacrificing performance. ### Testing There is a strong desire to be able to do testing with production data, be that git or database driven. ZFS clones can be used to provide this data for testing in a safe fashion as follows: a snapshot of a file system is created, and from this snapshot, a clone can be instantiated. The live file system and the clone share the data contained in the clone. The clone stores changes to the clone itself (additions, deletions, updates) without affecting the data in the live file system. When testing is completed, the snapshot and the clone can be discarded, returning the storage they consumed during their use. ## Architecture ## Base ZFS Node Design The layout of the any ZFS node would consist of an OS pool and a data pool. The OS pool would generally be the single local disk provided for OS installation; the data pool would be derived of a local SSD drive and attached Premium Networked SSD storage. The data pool would be configured to leverage the directly attached SSD as a tier in ZFS L2ARC cache storage. This allows for more frequently used block areas to be auto migrated to faster storage by ZFS and increasing performance. ### Anatomy of a ZFS Git Storage Node A storage node is a pool of POSIX-compliant storage with Gitaly running on top of it to perform Git operations against repositories hosted on said node. Repositories at the disk level are grouped in shards. Currently, there is one shard per host, but a Gitaly server can support multiple shards. A single ZFS storage pool in each storage node would contain one ZFS file system per Gitaly shard. If we ever decided to run multiple shards, each shard would be assigned a ZFS file system. These file systems share the storage in the underlying pool. A ZFS file system is the basic ZFS operating unit against which snapshots and clones be created, and remote replication accomplished. These operations are simply accomplished through the use of the zpool and zfs commands. ### Anatomy of a ZFS DB Storage Node A DB node is a pool of ZFS storage with PostgreSQL running on top of it to service the data persistent storage needs of the application. Database nodes would have a volume structure for all of the git data (currently `/var/opt/gitlab`) but would have a defined volume _just_ for the database content. This nested volume structure allows for growth and snapshotting of settings, logs, and data while at the same time providing for quick snapshots of **just** the core PostgreSQL data for DB replication to alternate environments / location via ZFS asynchronous replication. ## Implementation Considerations ### Testing We should perform proper benchmarking of ZFS on GCP through the use of tools like FileBench, which allow us to model workloads. Additionally, we should test: * Underlying device failures to gather data on resilvering speeds and throttling * Model unwanted deletions to test our ability to restore selectively or at the file system level * Setup remote replication on a desired target to measure latencies, lags, and speeds * Setup snapshots and clones to model the creation and teardown of testing environments * Setup a parallel staging environment where we can start measuring the performance for the database and the staging file servers. * Setup a dedicated shard for a subset of internal repositories before we role this out for new customers. * Compare latency histograms for real-world workloads from ext4 control and zfs experimental volumes (Proposal: eBPF + Prometheus eBPF exporter) ### GitLab.com and Self-managed The adoption of ZFS allows us to maintain independence from any cloud provider, so it is entirely feasible to use ZFS in both GitLab.com and self-managed installations. One important aspect to consider is how to migrate existing installations from their current file systems to ZFS. ## Operational Considerations ### Automation ZFS Chef cookbooks exist for ZFS on Linux: https://github.com/biola/chef-zfs_linux, and we will likely have to invest in automation to manage ZFS file systems in relation to Gitaly shards. Tooling will be necessary to manage snapshots and clones, especially in relation to testing environments. Project recovery tooling (from snapshots) is also necessary to ease the process. #### Disk Image Creation Since we are going to be retooling the underlying filesystem of the node architecture it only make sense to look at the way that we are generating nodes. Currently today we start from a base linux image and bootstrap it into Chef, then build all of our customizations and dependencies per-node. This is repetative work and no longer efficient when talking about desiring to have auto-scale services and with and eye towards kubernetes. With this project we would like to create a CI/CD pipeline that leverages Packer to produce GCP disk images that are 'feature complete' to be boostraped into needed purposses. This means that we would produce a GCP disk image that already had the latest kernel, patches, require software, configuration and was ready to be bootstrapped into an individual node role (i.e. sidekiq, web, git). #### Build Process Improvement We would also leverage our current Consul installation to provide three destinct variable settings `bootimage/current`, `bootimage/previous`, and `bootimage/next` which can be consumed by the Terraform Consul module so that when we update or refresh nodes they can get whatever image they need, up to and including what we're planning on putting in future images for testing. ### Monitoring Monitoring ZFS is not unlike monitoring other file systems (latency and iowait being some of the most important metrics to keep track of). Additionally, ZFS will produce events when devices fail. Metrics on scrub and resilver runs should be collected. If we were to use remote replication, lag times are critical in meeting SLAs, which need to be established. ## Additional Considerations ### LVM Snapshots (ext4, XFS) support for the creation of logical volume snapshots. As such, these take place below the file system level, which requires the file system be temporarily quiesced so as to be able to take a snapshot of the file system in a consistent state. Otherwise, the block-level snapshot may contain in-flight file system updates, yielding a potentially unstable file system. LVM does this automatically. A number of file systems can be frozen with the fsfreeze utility. This methodology temporarily suspends I/O to the file system, creating a potential write backlog that the storage subsystem must resolve once the snapshot has been taken and the file system thawed. Snapshot creation is quick, so this should not generally be a significant issue. As ZFS provides fully integrated volume manager and file system functionality, this process is far simpler and less disruptive. In some respects, ZFS adheres to the principle of building the simplest and most boring solution for the problem at hand, with the additional benefit of providing added functionality (such as end-to-end checksums, clones and remote replication) that is highly desirable and fully integrated. ### Ceph / CephFS >>> The Ceph Filesystem (CephFS) is a POSIX-compliant filesystem that uses a Ceph Storage Cluster to store its data. The Ceph filesystem uses the same Ceph Storage Cluster system as Ceph Block Devices, Ceph Object Storage with its S3 and Swift APIs, or native bindings (librados). >>> Ceph is an entirely different beast from single-host file systems such as XFS or ZFS. In some respects, it creates a monolith at the storage layer through a distributed file system (and Ceph does this in elegant ways). It is far from a simple solution, and some of its core functionality (such as object storage) is simply of no use to GitLab, given Git requires POSIX semantics. It is a non-trivial solution for something that the Gitaly architecture already solves (project sharding).