---
layout: handbook-page-toc
title: "CICD Pipeline for GitLab.com"
---

## On this page
{:.no_toc .hidden-md .hidden-lg}

- TOC
{:toc .hidden-md .hidden-lg}


# Idea/Problem Statement

Release managers initiate deploys manually from the commandline using takeoff.
The current process is that we build a release candidate or official build and
it is deployed to three different stages in sequence:

- Staging
- Production Canary Stage
- Production Main Stage

For each stage, developers run manual tests and GitLab-QA.
If there are no errors reported and manual tests pass, the release
continues to the next stage.

Below is a timeline of the 10.3 release to production.
From this, it is clear that we deploy at different intervals and not all release candidates make it to
production. In some cases, problems are not observed until we see a large
amount of production traffic which requires patching production or rolling back
the release.

## Release timeline of 10.3

_Staging and canary deployments omitted_

- RC1: Sept 3rd
- RC2: Sept 5th
- RC3: Sept 6th
- RC4: Sept 7th
- RC6: Sept 11th
- RC8: Sept 17th
- RC9: Sept 18th
- RC10: Sept 19th
- RC11: Sept 20th
- 10.3: Sept 22nd

## Current shortcomings

There are a number of shortcomings to the current release process:

* Release managing is time intensive because deploying to staging, canary and
  production is initiated manually.
* Large sets of changes see production traffic at once sometimes making it
  difficult to pinpoint what changes are causing issues.
* The staging environment is useful for GitLab-QA and manual testing, but does
  not receive any continuous traffic which can make it difficult to spot performance
  regressions before release candidates land on production.

# Design

The goal of this design proposal is a replacement for
deploying RCs manually to each deployment stage at predefined intervals.
A CICD pipeline is constructed that continuously deploys nightly builds to canary.

Once the deployment of the nightly builds to canary is complete, the canary fleet receives a small
percentage of production traffic, and it is then promoted to GitLab.com.

## Goals

The goals of this design are incremental and align with the [CICD blueprint](/handbook/engineering/infrastructure/blueprint/ci-cd/):

* Use a GitLab CICD for deployments from [https://ops.gitlab.net](https://ops.gitlab.net) that can be
  driven with GitLab ChatOps while also ensuring that there are no CICD or tooling dependencies
  on GitLab.com
* Deploy nightly builds to the production canary stage in a CICD pipeline.
* Create CICD stages that validate each deploy step, these steps include:
    * running GitLab-QA tests
    * check for alerts on the stage
* With a set of runners, run traffic on the staging and production canary stage.
* Report pipeline metrics to prometheus with a push gateway.
* Initiate database migrations on production for every deployment to the
  production canary stage.
* Promote nightly builds from canary to production, or push the official
  build through the pipeline for the self-managed omnibus release on the 22nd.

### Tasks

Below are a draft set of issues that would be in the epic for implementing this
design

#### Issues that are defined or in progress

* [Automation for deployments driven from GitLab ChatOps](https://GitLab.com/GitLab.com/gl-infra/infrastructure/issues/4739)
* [Drain and remove instances in batches from haproxy before operating on them](https://GitLab.com/gitlab-org/takeoff/issues/89)
* [canary.GitLab.com should use the corresponding backends for api/git traffic](https://GitLab.com/GitLab.com/gl-infra/infrastructure/issues/5232)
* [Create an internal api endpoint for canary](https://GitLab.com/GitLab.com/gl-infra/infrastructure/issues/5136)

#### Issues that need scoping

* A process for rolling back a single stage update or an entire release from production.
* [Create a canary sidekiq cluster](https://GitLab.com/GitLab.com/gl-infra/infrastructure/issues/5137)
* Create deployments for internal consumption: This is necessary to quickly
  release undisclosed security updates to GitLab.com.
* Continuous traffic against canary and staging
* Fast patches/releases to production to address high severity
  issues like security vulnerabilities or site degradation.
* Initiate the promotion of canary to production from GitLab, possibly with ChatOps
* Add alert checking to the CICD pipeline
* Add GitLab-QA to pipeline stages
* Report metrics from the CICD pipeline to the prometheus pushgateway
* GitLab ChatOps command to control weights on the canary stage, this controls how much
  traffic is directed to it.

### Criticism: Design anti-goals, what this doesn't cover

GitLab.com should be moving in a direction that utilizes a container deployment
strategy and dogfoods the cloud native product we are creating for customers. 
This design is meant to be compatible with the omnibus methods of installation
and does not include a container migration strategy, although it should be considered as 
a next step in that direction.

Overall, this design proposal focuses on work that is an
incremental change on our current infrastructure and process. 
Any work done in line with this proposal will be weighed against 
the goal of container based deployments and such work will be prioritised.


Specifically this design does not require any of the following:

* Removing the omnibus package as a deploy dependency
* Migrating services to kubernetes or using kubernetes for deployment orchestration
* Using pre-built images and auto-scaling
* Blue/Green deployments outside of what is currently capable with [canary](/handbook/engineering/infrastructure/design/canary/)

It does not preclude these items, but allows for a transition from using
non-container deploys.

This design does however make some improvements that will be helpful with the
longer term goals of creating pipeline(s) for continuous container deployments,
these are:

* Instrumenting CICD for checks against active alerts
* Instrumenting CICD that incorporates GitLab-QA
* Adding generated traffic to the non-production stages and production canary
  stage.
* Start integrating with ChatOps for deployments
* Smaller changes that automatically deployed up to canary, for internal use.
* Automatic migrations daily, resulting in more frequent and smaller database
  updates.


## Rollbacks and Patching

In order to safely deploy continuously to canary there also needs to be a way to
safely rollback and deliver fast patch updates. This design proposes three different approaches:

* Rollbacks to environments that deploy in reverse order on a deployment
  stage:  Rollbacks that are repeatable, safe and have the same impact as
  upgrades.
* Fast updates to environments:  In some circumstances expedience trumps
  availability. Updates may need to be
  applied quickly for when there are critical security vulnerabilities or
  serious performance degradation.  The update should be applied fast,
  and with minimal impact, but may result in some errors or dropped connections.
* Fast rollbacks to production: In the case of a serious release regression
  an environment may also need to be rolled back quickly. The rollback should
  be applied fast and with minimal impact, but may result in some
  errors or dropped connections.

## Testing

Testing will be an integral part of the deploy pipeline. For this reason,
included in the scope of this design is testing at every deploy step. The choice for this
testing will be to use a combination of GitLab-QA acting as a gate for pipeline
stages, and continuous traffic on non-production stages and the production
canary stage.  This allows us to use our existing alerting infrastructure on the
staging and canary stages so that regressions can be spotted early, before the
changes reach production. Each CICD step will have the ability to check for outstanding alerts before
continuing to the next stage.


### Generating artificial load on the non-production stages

In order to ensure that we can detect performance regressions it will be useful
to generate artificial load. This design does not go into the details of
how this is implemented, some proposals so far have been:

* Using siege for scraping predefined set endpoints
* Using a subset of GitLab-QA tests in a fleet of runners, running continuously
* [Large staging collider](https://gitlab.com/gitlab-com/large-staging-collider)
  for generating load.

## Architecture

### Current deployments

A deployment orchestration tool is necessary that can
drain servers from HAProxy, run apt installs of the omnibus package, and
restart/hup services after install. Currently this is done with
[takeoff](https://GitLab.com/gitlab-org/takeoff).


The current sequence of deployment to an environment is:

- Stop chef 
- Update the version role in Chef for the environment we are deploying to
- Deploy the omnibus to the deploy node
- Run migrations on the deploy node
- Deploy to gitaly (apt-get install gitlab-ee and restarts the restarts gitaly)
- Deploy to the rest of the fleet
    - parallel by role, done currently, apt-get install gitlab-ee and restarts the corresponding service
- Start chef


### Post-deployment patches

In addition to the normal release process of omnibus builds the production team
employs post-deployment patches, a way to quickly patch production for high
severity bugs or security fixes. 

Post-deployment patches bypass validation and exist outside of the normal
release process. The reason for this is to quickly
deploy a change for a critical security fix, a high severity bug, or to mitigate
a performance issue. The assumption is that once a post-deployment patch is
deployed, changes deployed to canary will be halted until the
patch(es) are incorporated into an omnibus-build.

### CICD Design

The CICD approach for omnibus is divided into two pipelines. One that
continuously deploys to canary using a nightly build and another
for deploying from canary to the rest of the fleet.

At any time during the deployment, if GitLab-QA fails or if there are any alerts the pipeline
is halted.

The second pipeline may be initiated automatically, or on-demand, when there is
confidence in the nightly build on canary.

Production traffic to canary is controlled with GitLab ChatOps by setting the
server weights. This allows us to at any time increase the amount of production
traffic on canary to have more confidence in application changes before it
reaches the wider community.

### Pipeline Diagram

![Azure Canary](img/cicd-omnibus.png)

#### Deployment stages to Canary

* Stage 1: Migrations on staging
    * Check for outstanding alerts on staging, do not start if there are any
      critical alerts active.
    * Run migrations from a deploy host
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
      stage.
* Stage 2: Deploy to staging Gitaly
    * Deploy to the Gitaly fleet.
    * Run GitLab-QA against staging.GitLab.com
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
      stage.
* Stage 3: Deploy to the staging fleet
    * Deploy to the remaining fleet, nodes are drained and removed from the load
      balancer as they are deployed.
    * Run GitLab-QA against staging.GitLab.com.
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
      stage.
* Stage 4: run post-deployment migrations on staging
    * Run post-deployment migrations
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
      stage.
* Stage 5: Migrations on production
    * Check for outstanding alerts on production, do not start if there are any
      critical alerts active.
    * Run migrations from a deploy host
    * Run GitLab-QA against GitLab.com.
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
      stage.
* Stage 6: Deploy to the production canary fleet
    * Ensures that there is no production traffic diverted to the canary fleet
      by setting the canary weights to zero.
    * Deploy to the canaries. While we operate on each node it is drained and removed from the
      load balancer.
    * Run GitLab-QA against canary.GitLab.com.
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, pass the pipeline.

------

#### Deployment stages from Canary to Production

* Stage 7: Deploy to the production Gitaly fleet
    * Check for outstanding alerts on production, do not start if there are any
      critical alerts active.
    * Using backend server weights, divert some production traffic to canary.
    * Check for outstanding alerts on production, do not start if there are any
      critical alerts active.
    * Deploy the version on canary to the production Gitaly server
    * Run GitLab-QA against GitLab.com.
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
      stage.
* Stage 8: Deploy the remaining production fleet
    * Check for outstanding alerts
    * Deploy to the remaining production fleet, nodes are drained and removed from the load
      balancer as they are deployed.
    * Run GitLab-QA against GitLab.com.
    * Check for outstanding alerts
    * If there are no alerts after an interval of time, continue to the next
* Stage 9: Run post-deployment migrations
    * Check for outstanding alerts
    * Run post-deployment migrations