--- layout: markdown_page title: "PostgreSQL High Availability Solution" --- ## On this page {:.no_toc} - TOC {:toc} ## Idea/Problem Statement GitLab.com uses [repmgr](https://repmgr.org) for managing replication and failover for its cluster of PostgreSQL servers, it is also shipped to on-prem customers in Omnibus. However, repmgr has some drawbacks. ### From architectural and algorithmic perspective Repmgr does not state what algorithm it uses for cluster operations, like electing a new master or fencing the older one, and in all cases ensure cluster consistency. Only deep inspection of the source code may reveal it. There’s no mention on the web page or its documentation. It is obviously not based on distributed consensus like Paxos or Raft. This means that, from a theoretical foundation, repmgr operations are likely prone to fail, or conversely provide significantly weaker correctness guarantees. From an architectural perspective, repmgr writes its state in the primary node. Unless pure synchronous replication would be used, which is undesirable for both performance and availability reasons, this means that the cluster state and monitoring information is inconsistent across the repmgrd daemons observing on each node of the cluster (since state needs to be written into the master, and then replicated asynchronously). So it becomes quite hard to assess how an implementation of a leader election or other operations can be correct if they operate over (potentially) different views of the cluster state from each node of the cluster. ### From the implementation perspective Repmgr does not support switchover and automatic failover with the same configuration. Switchover is incompatible with automatic failover, under brain-split risk. Quoting repmgr [docs](https://repmgr.org/docs/4.0/switchover-caveats.html): > repmgrd should not be running with setting failover=automatic in repmgr.conf > when a switchover is carried out, otherwise the repmgrd daemon may try and > promote a standby by itself. #### "Degraded" mode According to repmgr [docs](https://repmgr.org/docs/4.0/repmgrd-degraded-monitoring.html), there are certain cases (and its hard to consider them improbable) under which after a failover is triggered repmgr may end up in a state where no subsequent action may be taken, failing to promote a given node (i.e. leaving the cluster with no master). Only possible outcomes are that cluster state returns to the expected state for the operation to finally succeed or, if timeout is configured, to exit. But it seems repmgr is not capable of approaching other possible failover targets in the case the first one doesn’t succeed. Quoting repmgr docs: > In certain circumstances, repmgrd is not able to fulfill its primary mission > of monitoring the nodes' upstream server. In these cases it enters "degraded > monitoring" mode, where repmgrd remains active but is waiting for the > situation to be resolved. Situations where this happens are: * A failover situation has occurred, no nodes in the primary node's location are visible * A failover situation has occurred, but no promotion candidate is available * A failover situation has occurred, but the promotion candidate could not be promoted * A failover situation has occurred, but the node was unable to follow the new primary * A failover situation has occurred, but no primary has become available * A failover situation has occurred, but automatic failover is not enabled for the node * repmgrd is monitoring the primary node, but it is not available (and no other node has been promoted as primary) By default, repmgrd will continue in degraded monitoring mode indefinitely. However a timeout (in seconds) can be set with `degraded_monitoring_timeout`, after which repmgrd will terminate. repmgr monitors PostgreSQL and other nodes via standard PostgreSQL wire protocol connection. While normally an unprivileged user may suffice, many installations use a privileged user as it is the same one used for replication purposes. This is fine. However, if a database is saturated with as many as `max_connections` connections, new connections will be dropped, and this may also apply to repmgr connections. This confuses repmgr and could end up triggering a failover. Exactly [this problem](https://gitlab.com/gitlab-com/migration/issues/770) was already experienced in GitLab’s production environment, shortly after migrating from Azure to GCP. On network split events, it may happen that a group of nodes where the old primary was, are in even number. To avoid possible split-brain scenarios, a witness server is recommended. But it is [non-scalable](https://repmgr.org/docs/4.0/repmgrd-network-split.html): > repmgr enables provision of "witness server" to artificially create a quorum > of servers in a particular location, ensuring that nodes in another location > will not elect a new primary if they are unable to see the majority of nodes. > However this approach does not scale well, particularly with more complex > replication setups. Nor is a complete solution to the problem. There could be more complex scenarios (combinations of more nodes, some of them disconnected from others due to network partitions) where the witness server may not suffice to satisfy a given majority. We believe in these cases repmgr will leave the cluster in a state with no master. Due to the fact that state is written to the same database that is being monitored and replicated asynchronously, in the event of primary fail this state cannot be updated. This includes repmgr’s event notification system, which becomes then unavailable, making it [very fragile](https://repmgr.org/docs/4.0/event-notifications.html): > Each time repmgr or repmgrd perform a significant event, a record of that > event is written into the repmgr.events table together with a timestamp, an > indication of failure or success, and further details if appropriate. This is > useful for gaining an overview of events affecting the replication cluster. > However note that this table has advisory character and should be used in > combination with the repmgr and PostgreSQL logs to obtain details of any > events. […] Note that under some circumstances (e.g. when no replication cluster primary could be located), it will not be possible to write an entry into the repmgr.events table, in which case executing a script via event_notification_command can serve as a fallback by generating some form of notification. #### Lack of tests repmgr source code repository contains no failover tests. Neither unit nor integration tests. Only [one test](https://github.com/2ndQuadrant/repmgr/blob/8c422d60849b5b474bb356041a09a5ad50f9420a/sql/repmgr_extension.sql) is present in the repository, and it’s only testing that the functions that are installed with the extension are indeed present. ## Design To address repmgr short comings, we consider using [Patroni](https://github.com/zalando/patroni). It defines itself as "A template for PostgreSQL High Availability with ZooKeeper, etcd, or Consul". Patroni does not implement directly a consensus algorithm, but instead relies on a DCS (Distributed Consensus Store, that already implements a consensus algorithm) to support its operations. This means that: * An external cluster of either etcd, Consul or Zookeeper is required. Consul is recommended. This means more operational complexity. * Either of the three provides similar guarantees, which are basically that this cluster always presents a consistent view to its users (Patroni in this case) regardless of the cluster state (as long as any given majority of the nodes is alive; otherwise, cluster would become unavailable and will not service queries) and regardless of what node you connect or talk to. Being based on a DCS, Patroni’s job is greatly simplified. A brief description of how it operates follows: * Every node periodically write to the DCS its position in the replication stream (lag). All nodes see the same information, all the time. * Master node holds a lock that needs to renew every number of seconds. If it doesn’t, as long as all nodes see the same data, an election would be triggered. In other words, the HA software is greatly simplified by offloading the distributed consensus to the external DCS, at the cost of higher operational complexity. However, this means that the DCS already provides very strong correctness guarantees, and the confidence on the solution is significantly higher. If we would factor in other considerations that are required for HA, that also need to be taken into account even with Patroni, the HA architecture may look similar to that of the following (simplified) diagram: ![Patroni Cluster Architecture](https://docs.google.com/drawings/d/e/2PACX-1vQc_ID9tsco9Rv8K692gBz640-L8-ICMa1nH0YsUfNUl9FAd1clE2Ls-wwmpuyBs3c89LkmEOopNhaA/pub?w=985&h=1356) Apart from the extra operational complexity due to the use of an external DCS, Patroni has another significant drawback: as its own definition states, it is a "template", and not a full-blown HA solution. This means that extra work (could be significant) is required to implement the "missing pieces" (mostly entrypoint and fencing, which are heavily environment dependent). Patroni itself is not a standalone component of the architecture, but it is rather deeply integrated with PostgreSQL. In particular, Patroni: * Wraps PostgreSQL * Manages PostgreSQL configuration * Controls PostgreSQL lifecycle * Performs the initial initdb. This is specially relevant, as by default Patroni expects the database cluster to be empty (no data). This can’t obviously be the case, so for the transient state of the migration from the current cluster to Patroni’s cluster, Patroni would be setup in a special mode, called "standby", where it is there "watching" but not acting, while data is being replicated via normal Streaming Replication from the original cluster. And at the switchover time, Patroni would be activated, without performing an initdb. There is one Patroni per PostgreSQL node, as they are bundled together within the same instance. Indeed, Patroni is the process launched by the init program, and it launches, in turn, PostgreSQL. Consul is the recommended DCS as it is already used by GitLab and shipped with Omnibus. It is in turn recommended that Patroni will communicate with Consul via a Consul Agent, which acts as a proxy. Thus every Patroni has connection to the Consul Agent. Patroni also exposes an HTTP API. This API could by used for dynamic Patroni’s or even PostgreSQL’s reconfiguration, but in our case it’s mostly interesting to serve the health check for the load balancer that will serve as the entry point for the application. Patroni exposes the two following URLs: * http://patroni.i.p:8008/master It will return HTTP 200 if this node is the master node of the cluster, or HTTP 503 (Service Unavailable) otherwise. * http://patroni.i.p:8008/replica It will return HTTP 200 if this node is a replica node of the cluster, or HTTP 503 if it is the master. With this information it is easy to construct URLs for the health check of the entrypoint. For a read-write channel, the /master URL could be used. At most one node would respond to 200 to this URL, and the load balancer of the entry point would route connections, always, to the master. ### The Entrypoint The entrypoint is the mechanism by which PostgreSQL clients connect to the PostgreSQL instances. While on a "normal" scenario this is simple (for instance you may hardcode the IP and port), it doesn’t work on a dynamic scenario where the master node is a parameter subject to change dynamically. GCP internal load balancers (ILB) may be used for the purpose of serving as an entry point. Consider a read-write connection to PostgreSQL. On the ILB one proxied service may be created for the PostgreSQL cluster. The ILB will expose a virtual IP (provided by its internal high-availability mechanism) to PostgreSQL clients. At the same time, the ILB will use the health check URLs provided by Patroni (as previously described) and that will make the ILB to always point to the correct master, even after a failover. Certainly, the applications should be (re)-tested to try to re-connect on connection failure (as that is what will be perceived by the application in case of a failover). This is not exclusive to automatic failover, but should be tested in any scenario, always. ### Implementation Considerations #### Testing A new Patroni cluster is to be provisioned in staging in standby mode, replicating from the existing PostgreSQL cluster. The new cluster should have the same configurations and specs as the existing one. To minimize the maintenance window required to migrate the production PostgreSQL cluster to Patroni, multiple migration rehearsals should be performed in staging until we have a smooth migration plan. After each migration, it is recommended that a full QA suite to be run against staging as a sanity check that nothing has changed from the client perspective. #### GitLab.com and Self-managed Patroni should work for self-managed installations the same way it does for GitLab.com, with one difference that is the internal load balancer (ILB) as an entrypoint. GitLab.com uses an ILB because it is cloud-hosted and the ILB is guaranteed to be highly-available, but self-managed installations may not have access to such solution. Regardless, Omnibus [docs](https://gitlab.com/gitlab-org/gitlab-ee/blob/f164be45167f0e239262feed0fa6fc761ee8350a/doc/administration/high_availability/database.md#architecture) already suggest an entrypoint that is pgbouncer, and it can be used in of lieu an ILB. There are two approaches that can be used: 1. To continue using the Consul watchers (after adapting them to Patroni) we ship with Omnibus to update pgbouncer configuration to point to the current PostgreSQL master, or 1. To configure pgbouncer with DNS name that is provided by a Consul [service](https://www.consul.io/docs/agent/services.html) that resolves to the current PostgreSQL master Both approaches can leverage Patroni's API. For existing self-managed installations that want to migrate from repmgr to Patroni, [Patroni can take over a running cluster](https://github.com/zalando/patroni/blob/93d157dea39e40bb6fdfa3c5172e7b2622795e10/docs/existing_data.rst) when configured properly, but it would probably require careful coordination by Omnibus. ### Operational Considerations #### Automation Once a Patroni cluster is up and running, adding and/or removing new cluster members happen automatically when a node is newly created and/or decommissioned. #### Monitoring Existing alerts for PostgreSQL and pgbouncer should still function as expected. New metrics for Patroni need to be created as Patroni itself doesn't expose metrics. This can be achieved by scraping Patroni logs to alert on high error rates or if the Patroni process itself is down. ### Additional Considerations N/A ### Alternatives #### Stolon [Stolon](https://github.com/sorintlab/stolon) is a consensus-based high availability solution. Instead of implementing itself a consensus algorithm, it relies on a external service for this purpose, similar to Patroni. This external service could be either etcd, Consul or Kubernetes API. Stolon has a unique approach to the entry point problem: instead of relying on external components to indicate applications how to route connections to the (current) master, Stolon brings in a network proxy (PostgreSQL-specialized) to serve as an entry point. This is a very good architectural decision, as it allows transparency for the client, self-containment of the HA solution and fencing capabilities (no client can connect to a master that became disconnected from the rest of the nodes, since the proxy won’t allow it). The proxy is fed from the consensus external service, so it always has a consistent view.