--- layout: handbook-page-toc title: "Production" --- <%= partial("handbook/engineering/infrastructure/_common_links.html") %> ## On this page {:.no_toc .hidden-md .hidden-lg} - TOC {:toc .hidden-md .hidden-lg} # The Production Environment The GitLab.com production environment is comprised of _all_ the systems, code, and configurations, that operate–or support the operation of–gitlab.com. This includes, but is not limited to: - customers.gitlab.com - version.gitlab.com - ops.gitlab.net - dashboards.gitlab.net - log.gprd.gitlab.net - chef.gitlab.com - dev.gitlab.org Production is *not*: - about.gitlab.com - staging.gitlab.com - preprod.gitlab.com - design.gitlab.com - forum.gitlab.com ## Site Reliability Teams The Site Reliability teams are responsible for all of GitLab's user-facing services, most notably, GitLab.com. Site Reliability Engineers ensure that these services are available, reliable, scalable, performant and, with the help of GitLab's Security Department, secure. This infrastructure includes a multitude of environments, including staging, GitLab.com (production) and dev.GitLab.org, among others (see the [list of environments](/handbook/engineering/infrastructure/environments/)). SREs are primarily focused on the GitLab.com's availability, and have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug-free as possible, leveraging the tools provided by GitLab (we must dogfood). Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale GitLab.com by being smart and using resources effectively, starting with our own time as the main scarce resource. ## The Production Board The [**Production Board**](https://gitlab.com/gitlab-com/gl-infra/production/-/boards/1204483) keeps track of the state of Production, showing, at a glance, incidents, hotspots, changes and deltas related to production, and it also includes on-call reports. For a detailed description of the board, see [production/board](./board). ## GitLab.com We want to make GitLab.com ready for mission critical workloads. That readiness means: 1. Speedy ([speed index](/handbook/engineering/performance/#performance-target) below 2 seconds) 1. Available (uptime above 99.95%) 1. Durable (automated backups and restores, with regular and frequent verification and validation testing) 1. Secure (prioritize requests of our security team) 1. Deployable (quickly deploy and provide metrics for new versions in all environments) - View our [Availability and Response Time](http://stats.pingdom.com/81vpf8jyr1h9/4932705/2019/10) metrics month over month ### Tenets 1. Security: reduce risk to its minimum, and make the minimum explicit. 1. Transparency, clarity and directness: public and explicit by default, we work in the open, striving for signal over noise. 1. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way. ## How to Get Help * To see current SRE on call for urgent requests, issue the GitLab chatops command `/chatops run oncall prod`. If an issue already exists, please be ready to provide it to the SRE on call. * For [high severity issues](/handbook/engineering/infrastructure/team/reliability/incident-management#severities) that require immediate attention the best way to get help is to use `/pd ` in the `#production` channel on slack. * For non-urgent requests, open an issue in the [infrastructure tracker](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues) and mention `@gitlab-com/gl-infra/managers` for scheduling. ## Why `infrastructure` and `production` queues? ### Premise Long term, additional teams will perform work on the production environment: * Release Engineering performs deployments on production * Security performs scans against the production * Google may perform work on the underlying production infrastructure We cannot keep track of **events** in production across a growing number of *functional* queues. Furthermore, said teams will start to have on-call rotations for both their function (e.g., security) and their services. For people on-call, having a centralized tracking point to keep track of said events is more effective than perusing various queues. Timely information (in terms of when an event is happening and how long it takes for an on-call person to understand what's happening) about the production environment is critical. The `production` queue centralizes production event information. ### Implementation Functional queues track team workloads (`infrastructure`, `security`, etc) and are the source of the work that has to get done. Some of this work clearly impacts production (build and deploy new storage nodes); some of it will not (develop a tool to do x, y, z) until it is deployed to production. The `production` queue tracks events in production, namely: * [changes](/handbook/engineering/infrastructure/change-management/) * [incidents](/handbook/engineering/infrastructure/team/reliability/incident-management/) * deltas (exceptions) -- still need to do handbook write up Over time, we will implement hooks into our automation to *automagically* inject change audit data into the `production` queue. This also leads to a single source of data. Today, for instance, incident reports for the week get transcribed to both the On-call Handoff and Infra Call documents (we also show exceptions in the latter). These meetings serve different purposes but have overlapping data. The input for this data should be queries against the `production` queue versus the manual build in documents. Additionally, we need to keep track of error budgets, which should also be derived from the `production` queue. We will also be collapsing the `database` queue into the `infrastructure` queue. The database is a special piece of the infrastructure for sure, but so are the storage nodes, for example. For the on-call SRE, every event that pages (where an event may be a group of related pages) *should* have an issue created for it in the `production` queue. Per the [severity](#severity) definitions, if there is at least *visible* impact (functional inconvenience to users), then it is by definition an incident, and the Incident template should be used for the issue. This is likely to be the majority of pager events; exceptions are typically obvious, i.e. they impact only us and customers won't even be aware, or they're alerts that are pre-incident level which by acting on we avoid incidents. ### Security Related Changes All direct or indirect changes to authentication and authorization mechanisms used by GitLab Inc. by customers or employees require additional review and approval by a member of at least one of following teams: * [production team](/handbook/engineering/infrastructure/production/) member * [security team](/security/) member * developer from a different team that is staff level or higher This process is enforced for the following repositories where the approval is mandatory using [MR approvals](https://docs.gitlab.com/ee/user/project/merge_requests/merge_request_approvals.html): * [gitlab-oauth2-proxy](https://gitlab.com/gitlab-cookbooks/gitlab-oauth2-proxy) * [gitlab_users](https://gitlab.com/gitlab-cookbooks/gitlab_users) Additional repositories may also require this approval and can be evaluated on a case-by-case basis. When should we loop the security team in on changes? If we are making major changes to any of the following areas: 1. Processing credentials/tokens 1. Storing credentials/tokens 1. Logic for privilege escalation 1. Authorization logic 1. User/account access controls 1. Authentication mechanisms 1. Abuse-related activities | #### Type Labels Type labels are very important. They define what kind of issue this is. Every issue should have one or more. | Label | Description | |--------------------|-------------------------------------------------------------------------------------------------------------------------| | `~Change` | Represents a Change on infrastructure please check details on : [Change](/handbook/engineering/infrastructure/change-management/) | | `~Incident` | Represents a Incident on infrastructure please check details on : [Incident](/handbook/engineering/infrastructure/team/reliability/incident-management/) | | `~Database` | Label for problems related to database | | `~Security` | Label for problems related to security | #### Services The services list is mentioned here : https://gitlab.com/gitlab-com/runbooks/blob/master/services/service-catalog.yml #### Services Criticality Labels: Service Criticality labels help us to define, how critical is the service and could be a change in the infrastructure, considering how will affect the user experience in case of a failure. I.e. `~C1` Postgresql or Redis Master. As most of the services could reach different levels of criticality we consider here the highest, also we have the template of actions for a change depending on the criticality: | Label | Description | |-------|---------------------------------------------------------------------------------------------------| | `~C1` | Vital service and is a single point of failure, if down the application will be down | | `~C2` | Important service, if down some functionalities will not be available from the application | | `~C3` | Service in case of some instance is down or the service, we can have performance degradation | | `~C4` | Services that could be in maintenance mode or would not affect the performance of the application | #### Service Redundance Level The service redundancy level helps us to identify what services has the avaiability of failover or if there is another mechanism of redundancy. | Label | Description | I.e. | |-------|-----------------------------------------------------------------------------------------------------| | `~R1` | The loss of a single instance will affect all the users | Instance in PostgreSql or Redis | | `~R2` | The loss of a single instance will affect a subset of users | Instance in Gitaly | | `~R3` | The loss of a single instance would not affect any user | Instance of grafana | #### Other Labels We use some other labels to indicate specific conditions and then measure the impact of these conditions within production or the production engineering team. This is specially important from the time investment in specific parts of the production engineering team, to reduce toil or to reduce the chance of a failure by accessing to production more than enough. Labels that are particularly important for gathering data are: - `~toil` Repetitive, boring work that should be automated away. - `~unscheduled` An issue that became an interruption to the team and had to be handled in a Milestone. It's unplanned work. - `~unblocks others` An issue that is allowing some other part of the company to deliver something. ### Always Help Others We should never stop helping and unblocking team members. To this end, data should always be gathered to assist in highlighting areas for automation and the creation of self-service processes. Creating an issue from the request with the proper labels is the first step. The default should be that the person requesting help makes the issue; but we can help with that step too if needed. If this issue is urgent for whatever reason, we should label them following the instructions above and add them to the ongoing Milestone. ## On-Call Support For details about managing schedules, workflows, and documentation, see the [on-call documentation](/handbook/on-call/) and the [on-call runbook checklist](https://gitlab.com/gitlab-com/runbooks/blob/master/on-call/checklists/eoc.md). ### SLA for paging On-Call team members When an on-call person is paged, either via the /pd command in slack or the automated monitoring systems, the SRE member will have a 15 minute SLA to acknowledge or escalate the alert. This is also noted in the [On-Call section of the handbook](/handbook/on-call/). Because GitLab is an asynchronous workflow company, @mentions of On-Call individuals in slack will be treated like normal messages and no SLA for response will be attached or associated with them. This is also because notifications over phones via Slack have no escalation policies. PagerDuty has policies that team members and rotations can configure to make sure an alert is escalated when no person has acknowledged the alert. If you need to page a team member from slack - you can do the /pd "your message to the on-call here" to send out an alert to the currently on-call team members. ### On-Call escalation Given the number of systems and service that we use, it's very hard if not impossible to reach an expert level in all of them. What makes it even harder is the rate of changes made to our infrastructure. For this reason, the person on-call is not expected to know everything about all of our systems. In addition, incidents are often complex and vague in their nature requiring different perspectives and ideas for solutions. Reaching out for help is considered good practice and should not be mistaken for incompetence. Asking for help while following the escalation guidelines and checklists can expose information and result in faster resolution of problems. It also improves the knowledge of the team as a whole when for example an undocumented problem is covered in runbooks after an incident or when questions are asked in Slack channels where others can read it. This is true for on-call emergencies as well as project work. You will not be judged on the questions you ask, regardless of how elemental they might be. The SRE team's primary responsibility is availability of gitlab.com. For this reason, helping the person on-call should take priority over project work. It doesn't mean that for every single incident, the entire SRE team should drop everything and get involved. However, it does mean that as knowledge and experience in a field that is relevant to a problem, they should feel entitled to prioritize that over project work. Previous experiences have shown that as the incident's severity increased or potential causes were ruled out, more and more people from across the company were getting involved. ## Production Events Logging There are 2 kind of production events that we track: - Changes to the production fleet: for this we record things [in the Chef Repo](https://dev.gitlab.org/cookbooks/chef-repo). - Deploys will be recorded automagically because of the way we do deploys. - General operations can be recorded by creating an empty commit in the repo and pushing it into origin. - Outages and general production incidents - If we are required to act in production manually to perform any operation we should create an issue and consider labeling it as _toil_ to track the cost of such manual work load. - If we had a disruption in the service, we must create a blameless root cause analysis. Refer to the [Blameless Root Cause Analyses page](/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html) ### Incident Subtype - Abuse For some incidents, we may figure out that the usage patterns that led to the issues were abuse. There is a process for how we define and handle abuse. 1. The definition of abuse can be found on the [security abuse operations section of the handbook]( ../../security/#abuse-operations) 1. In the event of an incident affecting GitLab.com availability, the SRE team may take actions immediately to keep the system available. However, the team must also immediately involve our security abuse team. A new [security on call rotation](/handbook/engineering/security/#engaging-the-security-on-call) has been established in PagerDuty - There is a Security Responder rotation which can be alerted along with a Security Manager rotation. ## Backups ### Summary of Backup Strategy - Backups of our databases are taken every 24 hours with continuous incremental data (at 60 sec intervals) streaming into a separate cloud service from the production fleet. These backups are encrypted. - Backups of our filesystems are taken via GCP snapshots every 24 hours. - Both database and filesystem backups are kept for 2 weeks on a rolling duration. For details see the runbooks, in particular regarding details on [GCP snapshots](https://gitlab.com/gitlab-com/runbooks/blob/master/howto/gcp-snapshots.md) and [Database backups using WAL-E (encrypted)](https://gitlab.com/gitlab-com/runbooks/blob/master/howto/using-wale-gpg.md) ## Patching ### Policy All servers in the production environment managed and maintained by the GitLab infrustructure team will be proactively maintained and patched with the latest security patches. ### Summary of Patching Strategy All production servers managed by chef have a base role that configures each server to install and use [`unattended-upgrades`](https://ops.gitlab.net/gitlab-cookbooks/chef-repo/blob/8c522363bde0248f6d66adae0d1b6c233d31d261/roles/gprd-base.json#L31-42) for automatically installing important security packages from the configured apt sources. `Unattended-upgrades` check for updates everyday between 6 and 7 am UTC. The time is randomized to avoid hitting the mirrors at the same time. All output is logged to `/var/log/unattended-upgrades/*.log`. Unattended upgrades is configured to automatically patch all security upgrades for packages with the exception of the GitLab omnibus package. Critical security releases for GitLab are deployed to GitLab by a separate process. You can read more about that process in the [release docs](https://gitlab.com/gitlab-org/release/docs/blob/master/general/security/process.md#critical-security-releases). ### Patching Validation Currently validation can be done manually by cross examining the logs of the host with the scans done by tenable. ## Penetration Testing Infrastructure will provide support to the [security team](../../security) for issues found during penetration testing. For coordinating a pen test or to coordinate any procedures to address and remidiate vulnerabilities, should be communicated to the infrastructure team through an issue in the [infrastructure issue tracker](https://gitlab.com/gitlab-com/infrastructure/issues/). In the issue please provide the following: * scope of the testing * a suggested time frame * the depth of testing * which services will be tested * the procedures being done * any possible teams that may be affected (such as support, security, etc) Please tag issues with the `~security` label and `/cc` infrastructure managers.