--- layout: markdown_page title: "Joe, a Postgres query optimization bot" --- ## On this page {:.no_toc} - TOC {:toc} ## Resources Epic: [gl-infra/76](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/76) Blueprint: [Postgres query optimization bot to boost backend development process](/handbook/engineering/infrastructure/library/database/postgres/query-optimization-bot/blueprint/) ## Problem Statement and Idea Any database schema and database-related application code changes need verification. The verification process can be considered as entirely relevant and efficient (development process-wise) only when backend engineers can perform such verification on production-scale data sets and without a DBRE's help. Until Autumn 2019, at GitLab, performing verification of such changes on production-scale data sets implied the following problems: - there is only a single staging environment in GitLab, and its database size is ~10 times less than production; - not all backend engineers have access to production (which is typical for any fast-growing or grown organization) and those who have it, have only limited, read-only access; - for engineers with production access, it is not reasonable to verify heavy queries right on production nodes, due to high risks to affect production systems performance (and sometimes such verification is impossible at all – for example, consider any DB schema changes); - troubleshooting of `INSERT`/`UPDATE`/`DELETE`, `TRUNCATE` and DDL (`ALTER`, `CREATE`, `DROP`) on production is, obviously, prohibited; - provisioning of production clone takes significant time and manual efforts and in case of multiple iterations of change verification are needed, it becomes incredibly inconvenient, since for each iteration, a new provisioning is required. As described in [the Blueprint]((/handbook/engineering/infrastructure/library/database/postgres/query-optimization-bot/blueprint/)), a new approach was proposed: the use of [Joe bot](https://gitlab.com/postgres-ai/joe) and [Database Lab](https://gitlab.com/postgres-ai/database-lab) created and maintained by [Postgres.ai](https://postgres.ai). The main ideas behind them are: - Joe is a Slack bot, working similarly to the current chatops tool, - Joe talks directly to an independent Postgres instance on ZFS filesystem (thin clone), containing a database cloned from production (restored from WAL-E/WAL-G archive), - the bot can provide detailed execution plans to users, without exposing the data directly, - the bot can apply any database schema changes such as new index creation, - thanks to Database Lab and ZFS snapshots and clones, it is possible to revert all the changes made by users quickly, including those that Postgres cannot revert itself (such as a committed transaction that dropped some DB objects), - moreover, Database Lab uses thin cloning technique to allow having independent Postgres databases operating on the same machine, so developers working on micro-optimization tasks do not interfere with each other, while the GCP budgets are fixed, - the bot provides additional explanations of the execution plan and suggestions on query optimization, - for security reasons, all actions are logged in a special audit log, including timestamp, action details, and requesting person ID and name. As a result, with the new tool, any backend developer from the GitLab team is capable of verifying any SQL query plans on a production-scale database clone, without having access to production and not waiting for help from SREs or DBREs. Engineers without in-depth query optimization knowledge also benefit from optimization bits of advice the bot provides. ## Design ![The architecture of Joe, a Postgres query optimization bot](joe_with_control_instance.png){: .shadow} ### User Interface #### Slack Users (backend developers and DBREs) interact with a production database clone using Slack. They can execute queries and get execution plans with performance metrics without direct access to the data. Slack app configuration is described in [Joe's README](https://gitlab.com/postgres-ai/joe#slack-app). ##### Chatbot commands - `help`: show available commands. - `explain`: get the execution plan for an SQL statement. First, it performs just `EXPLAIN` for the provided query, which is very fast because no actual query execution is performed. It immediately allows understanding how "bad" the plan is. Immediately after this `EXPLAIN (BUFFERS, ANALYZE)` for the query is being performed. Once it's done, the detailed execution plan is provided, with some optimization recommendations. Example: `explain select * from issues where id = 1;` - `exec`: execute any query. Usually, it is used to modify the database schema. For example, to create a new index: `exec create index i_new_1 on issues (last_edited_by_id);` - `reset`: reset the state of your database clone. Useful for iterating. It is not needed at the end of the session: all sessions are independent, and after 2 hours of inactivity, your thin clone will be deleted automatically, so that the next session starts in a "fresh" state again. - `\d`, `\d+`, `\dt`, `\dt+`, `\di`, `\di+`, `\l`, `\l+`, `\dv`, `\dv+`, `\dm`, `\dm+` — psql meta-information commands. Useful to explore the current DB schema, learn current sizes of any DB object. You can find each command explaination in the [official PostgreSQL documentation](https://www.postgresql.org/docs/current/app-psql.html). #### CLI Slack as an interface is very accessible. It provides some basic collaboration and search capabilities. At the same time, it is not always (and not for everyone) convenient during development. Having a CLI as an alternative interface might be a good option. [Database Lab](https://gitlab.com/postgres-ai/database-lab#usage) provides CLI for those who are granted full access to production data. With Database Lab CLI it becomes possible to request for a thin clone of production database, get it in a few seconds, and then work using any tool, including official CLI (psql) or any third-party GUI. #### GUI GUI with visual presentation of execution plans is another convenient feature that can be implemented in the future. Examples of visual representation of plans that can be integrated with Joe: - [explain.depesz.com](https://explain.depesz.com/) ([GitLab repo](https://gitlab.com/depesz/explain.depesz.com)), - [Postgres Explain Visualizer (pev)](http://tatiyants.com/postgres-query-plan-visualization/) ([GitHub repo](https://github.com/AlexTatiyants/pev)), - [explain.dalibo.com](https://explain.dalibo.com), - [explain.tensor.ru](https://explain.tensor.ru) (Russian only). As of January 2020, many developers at GitLab prefer using explain.depesz.com to visualize queries. At Postgres.ai, there is work in progress to provide better means to integrate Joe with the best Postgres plan visualization tools. ### Storage and Filesystem The data (a clone of production database) is located on an SSD Persistent Disk (PD) in Google Cloud. The data on the PD is continuously updated thanks to Database Lab's so-called "shadow instance" – a Postgres instance which continuously receives and replays WAL data from WAL archive (WAL-E/WAL-G). Periodically, the baseline snapshot used for thin cloning is being refreshed (as of January 2020, manually, 2-4 times per month; we plan to switch to automated refreshes in the nearest future). ### GCE Instances For better disk IO characteristics, GCE instances have to have more vCPUs. The best performance is achieved with 60 or more vCPUs (see [Google's official documentation](https://cloud.google.com/compute/docs/disks/performance#ssd-pd-performance) with PD performance specifications). As of July 2019, GitLab.com's production Postgres nodes have machine type `n1-highmem-64`, with 64 vCPUs and 416 GiB of memory (Update: a few months later, GitLab.com database switched to `n1-highmem-96`, with 96 vCPUs and 624 GiB). It means that sustained disk throughput is 1,200 MiB/s for reads and 400 MiB/s for writes. Joe is not designed to work with hundreds or thousands of users simultaneously, which would imply having many gigabytes of hot caches. So, higher volumes of memory are not required. Use of `n1-highcpu-32` with 32 vCPUs and 28.8 GiB of memory looks reasonable, meaning that a similar level of disk characteristics will be achieved for much fewer budgets. For additional budget savings, it is reasonable to consider using a preemptible instance (price for preemptible `n1-highcpu-32` is 78.8% lower compared to the regular `n1-highcpu-32`, see [the official documentation](https://cloud.google.com/compute/vm-instance-pricing)), re-provisioning such instance only when it is needed, keeping PD independent and attaching it, or re-creating it from a PD snapshot. However, this requires a fully automated and hassle-free approach to Database Lab instance provisioning. The planned implementation of Kubernetes support in Database Lab and Joe bot will meet these needs. ### HTTP Server To receive Slack events, we need to set up a publicly available hostname, and NGINX server with SSL enabled. We use 443 port for SSL connection. An SSL certificate has to be generated for the hostname. ### Implementation Considerations #### Testing For the period of initial testing, we provisioned an instance of type `n1-highcpu-32`, with 6 TiB standard (not "SSD") PD. Obtaining of Postgres database clone was done using the so-called "restore" project that is used for daily backup verification. On this PD, we created a ZFS pool using the following command: ```bash sudo zpool create \ -O compression=on \ -O atime=off \ -O recordsize=8k \ -O logbias=throughput \ -m /var/opt/gitlab \ zpool "/dev/disk/by-id/google-${INSTANCE_NAME}-data-disk" \ -f ``` No additional ZFS tuning was performed during this test (such as setting certain ARC cache values). The production database was restored from the WAL-E archive (in the same way that it is done daily for backup verification). Then, the database was promoted to become a primary (to allow changes). Right after that, a ZFS snapshot was created to support Joe's `reset` command. The restoring of the database was done on the same node. So for testing purposes, the setup included only one GCE instance: ![Single instance architecture](joe_single_instance.png){: .shadow} Joe bot was added to GitLab's Slack workspace. Slack channel [#database-lab](https://gitlab.slack.com/messages/CLJMDRD8C) was created. ![Example of work with Joe in Slack](joe_example.png){: .shadow} #### Security Aspects Joe works with a clones of production databases. Users see only query execution plans, they do not have direct access to the data. On database machines being by Joe, only 22 and 443 ports are open. The latter is used to organize communication with Slack. PostgreSQL is configured to work only locally. While direct access to the data is not allowed, users can get information about presense of concrete values in the database (example: analyzing the pland of `SELECT .. FROM t1 WHERE c1 = :some_value` we can understand how many records with `c1 = :some_value` are present in the table `t1`). This makes possible to use [blind SQL injection](https://www.owasp.org/index.php/Blind_SQL_Injection): each query allows to leak some bits of information. The following is implemented in Joe code to mitigate these risks: - query rates limits for each user, - audit log to track any user's activity. #### Multi-user Support Joe relies on Database Lab to provide thin provisioning capabilities. Database Lab uses thin clones (in this setup implemented using ZFS snapshots and clones) to provide all users fully independent database clones of 6 TiB production database in a few seconds. #### Cache Control It is not uncommon that during SQL query optimization, developers overlook the fact that query performance highly depends on the state of OS and Postgres caches (the file cache and the buffer pool). Joe helps to understand, was the most of the data obtained from the Postgres buffer pool or OS cache, or it was mostly read from the disk. There are plans to implement simple commands to allow users to reset both caches at any time. This work is in progress. ### Operational Considerations #### Automated Provisioning Currently, Joe is installed and is being maintained using semi-automated approach. We are working on automated provisioning on Kubernetes and GCP infrastructure. #### Monitoring Joe's process writes to the local logfile, `joe.log`. Audit records are marked with `AUDIT` prefix. The future work includes setting up the log shipping for centralized analysis, as well as technical monitoring and alerting describing the state of Database Lab and Joe bot (free disk space, CPU usage, etc.) ## Testing Period. Summary First appeared in Summer 2019 in Slack channel [#database-lab](https://gitlab.slack.com/archives/CLJMDRD8C) (internal link), Joe quickly received attention from GitLab engineers. As of January 2020, more than 70 people have joined the `#database-lab` channel, and 49 of them have used Joe to optimize SQL at least once during the last 3 months. The old chatops tools (allowing to EXPLAIN queries on a production clone, limited to read-only queries lasting not more than 15 seconds) is still being used [The analysis](https://docs.google.com/spreadsheets/d/1c46FKjBXcW89goY-tLVrBXxW7Y_cbIJrV7UGBZiRKbo/edit#gid=1110872840) of both tools for the period of 4 weeks, since December 31, 2019, till January 27, 2020, revealed the following: - the number of commands sent to Joe is more than 4 higher than those sent to chatops (698 vs. 156); - on average, Joe receives 28 commands per day (excluding "technical" commands such as help, reset), while chatops gets ~6 commands daily; - the average amount of commands in daytime communication between a developer and Joe is 8.8, while for chatops it is 2.9; - meanwhile, still, many developers use chatops: during 4 weeks analyzed, 24 people communicated with the chatops tool and 27 talked to Joe. ![How many users used chatops vs. Joe in January 2020](joe_and_chatops_202001_users.png){: .shadow} ![How many SQL optimization calls were made to chatops vs. Joe in January 2020](joe_and_chatops_202001_calls.png){: .shadow} See also: [Comparison of SQL micro-optimization methods](/handbook/engineering/infrastructure/library/database/postgres/query-optimization-bot/blueprint/#comparison-of-sql-micro-optimization-methods) ## Future Work - Extend documentation for GitLab developers explaining how to use Joe - Upgrade Joe to new version to speed up thin cloning (from 30-40 seconds to 1-2 seconds), and add various usability improvements - Support 'explain-rollback' ([MR](https://gitlab.com/postgres-ai/joe/-/merge_requests/64)) - Means to verify Rails DB migrations ([MR](https://gitlab.com/postgres-ai/joe/-/merge_requests/55)) - Provide cache control - Provide ability to share a session or attach to someone's session - Improve optimization recommendations, suggest indexes, partitions, etc. - Switch to Kubernetes to get rid of manual control - Integrate visualization tool - Provide history and plan sharing capabilities for better collaboration ------ Author: [Nikolay Samokhvalov](https://gitlab.com/NikolayS) {: .note}