---
layout: markdown_page
title: "Blueprint: Postgres query optimization bot to boost backend development process"
---

## On this page
{:.no_toc}

- TOC
{:toc}

## Overview

This blueprint addresses the following problems:
- how database-related changes created by GitLab developers are verified,
- how the reviewing MRs from the database (Postgres) perspective is organized,
- SQL query micro-optimization in general, including work on the existing and
new SQL code and ActiveRecord (Ruby on Rails) code.

SQL query micro-optimization is a process of optimization of an SQL (or the
corresponding Ruby on Rails code), involving only one or a few SQL statements,
usually executed one by one.
The goal of such a process is achieving better performance and scalability of
particular SQL queries without significant degrading of these characteristics
for all other SQL queries that are executed in the database. (In contrast,
SQL query macro-optimization addresses the top-down performance and scalability
analysis of a particular database and its workload, usually done using some
statistics collecting tools and/or logs analysis. This kind of optimization is
beyond the scope of this article). SQL micro-optimization involves changes in
the index set, query re-writing, DB schema re-design, and, less often,
local changes in database configuration parameters.

Further, we discuss the drawbacks of the current (as of June 2019) SQL
micro-optimization workflow used in GitLab development, such as:

- higher likelihood of having performance and scalability issues not determined
before the deployment to production,
- complications of verification of performance troubleshooting of
UPDATE/DELETE/INSERT queries,
- complications of verifying of index ideas,
- delays in MR approval.

We propose a new approach which is a successor of the existing one.
In particular, we propose switching to a new Slack bot called "Joe bot",
built on top of a special framework, "Database LaB". Working in pair,
Joe bot and Database Lab are capable of performing a series of
`EXPLAIN (ANALYZE, BUFFERS)` requests on a full-sized fresh production DB clone
provisioned during several seconds, and allowing work with any types of queries
including modifying ones. Working with an independent cloned database, Joe also
allows to change database schema anyhow, create indexes, and reset the database
to its initial state within a few seconds, being ready for a new
micro-optimization session.

Both Joe bot and Database Lab are open source projects developed and maintained
by [Postgres.ai](https://postgres.ai), and adapted for use in GitLab.com
infrastructure by [Nikolay Samokhvalov](https://gitlab.com/NikolayS/).

## Analysis

When code changes involve a multi-terabyte database, it is not uncommon to have
the following problem. A developer has an SQL query (or the corresponding ActiveRecord code),
that is known to be problematic, or that is new, and she/he wants to check its
performance and Postgres query optimizer behavior at production scale, but it is
impossible because one or more of the following reasons (and sometimes most or
even all of them):

1. The query is time/resource-consuming and might negatively affect the
production system's performance.
1. The query is a modifying one, changing the database state, in extreme cases
changing heavily (and resulting in a lot of I/O operations) or even ones that
are not transaction-safe (such as `TRUNCATE`).
1. Some DDL (example: `CREATE INDEX`) needs to be executed, and doing so on
production just for development and testing purposes seems exceptionally unreasonable.
1. Most developers do not have access to the production database servers or have
only read-only access.

### The Existing Solution

GitLab employees can use a [special tool](https://docs.gitlab.com/ee/development/understanding_explain_plans.html#chatops),
available in Slack: `/chatops run explain` command. It works great and partially solves
the last problem in the list above (providing developers indirect read-only access to
production databases), but does not address any of the others. It works with one
of the actual production replicas. Therefore, heavy, prolonged, and/or modifying
operations are not allowed at all.

All this brings us to the position when even for a simple check, developers need
to ask DBREs for help, which becomes a significant bottleneck in the development
process.

### The Staging Environment: Single, Differs from Production

The separate problem is that MRs that involve database-related changes are often
not verified on production-scale data volumes at all due to the complexity of
such checks implying enormous amount of DBRE's manual efforts.

Currently (June 2019), the staging database size is only ~0.5 TiB
([postgres-checkup report for GitLab.com staging](https://gitlab.com/gitlab-com/gl-infra/production/snippets/1866829)),
while the production DB exceeds ~3 TiB
([postgres-checkup report for GitLab.com production](https://gitlab.com/gitlab-com/gl-infra/production/snippets/1862519)).

This means that table and index sizes on the staging DB nodes are
significantly smaller than those on the production DB nodes, so SQL query
behavior differs significantly (planner might choose very different plans,
timings are smaller, and so on).

Additionally, as of June 2019, there is only one single staging
environment, so it is not trivial to test the same modifying change multiple
times, and it is extremely difficult if not impossible to parallelize various
kinds of checks (for example, verifying various MRs in parallel).

### Problem Statement

1. As for June 2019, most MRs are **never** verified on a full-size DB copy.
Database code reviews are done based only on the code check, without
analyzing SQL statements on production-scale data sets.

1. Most of the time, engineers develop changes using only tiny-size
databases.

1. Database-related review processes are often slowed down. Among the main reasons:
    - too few active database reviewers, and they are too busy (June 2019),
    - lack of data leading to multiple iterations of question/answer.

1. Checking of modifying SQL statements and migrations is difficult for a DBRE,
and impossible for a developer:
    - usually, there is no a full-size DB instance ready for that,
    - during development, the statement/migration under question needs to be
analyzed multiple times, in iterations. Without special means (ability to
quickly start from scratch), it becomes extremely difficult, often leading
mistakes and development process slowdown.

## The Proposed Solution

I propose using the new-generation of Slack bot, providing any engineer with
the ability to do SQL query micro-optimization on a full-size production clone,
running any kinds of queries without short timeout limits, with the ability to
quickly restore the database state after any changes and without interfering
with other developers.

My team ([Postgres.ai](https://postgres.ai)) implemented these ideas in two
open-source projects written in Go:

1. [Joe bot](https://gitlab.com/postgres-ai/joe),
1. [Database Lab](https://gitlab.com/postgres-ai/database-lab).

Joe is integrated with Slack and supports the following commands
(preferably in a public channel):

- `explain` – run a query on DB and see a detailed plan with performance advice,
- `exec` – run a DDL statement (create tables, indexes, etc.),
- `reset` – restore the state of DB instance (discard all changes),
- `status` – get information about the current session and queries status,
- `\d`, `\d+`, `\dt`, `\dt+`, `\di`, `\di+`, `\l`, `\l+`, `\dv`, `\dv+`,
`\dm`, `\dm+` — psql meta-information commands.

![Joe Demo](demo.gif){: .shadow}

Currently (June 2019), the command `reset` is based on ZFS snapshots. Joe now supports AWS,
but GCP support would be implemented similarly. Using GCP snapshots instead of
ZFS may be an option too.

This approach can help to:

- reduce the delays analyzed in the Analysis section,
- provide both MR authors and database reviewers with useful information about
database changes and performance, and optimize their queries and migrations,
- help avoid cases when database-related changes are deployed to production
without verifying them against full-sized data sets.

Estimated time needed for the PoC implementation: 2 weeks. Resources needed:

- `n1-highcpu-64` or `n1-highcpu-32` GCP instance with GitLab.com DB production
database copy (available using [gitlab-restore project](https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd)),
- participation in PoC reviewing by an engineer who has experience in
development of database-related MRs.

## Comparison of SQL micro-optimization methods

![Delays in MRs](comparison.png){: .shadow}


## Further reading

Continue reading: [the Design document](/handbook/engineering/infrastructure/library/database/postgres/query-optimization-bot/design/),
describing th implementation and results of using the proposed solution in
September 2019 – January 2020.

------

Author: [Nikolay Samokhvalov](https://gitlab.com/NikolayS) {: .note}