Add a configuration option (`set_acls`) for Zookeeper DCS so that Kazoo will apply a default ACL for each znode that it creates. The intention is to improve security of the znodes when a single Zookeeper cluster is used as the DCS for multiple Patroni clusters.
Zookeeper [does not apply an ACL to child znodes](https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#sc_ZooKeeperAccessControl), so permissions can't be set at the `scope` level and then be inherited by other znodes that Patroni creates.
Kazoo instead [provides an option for configuring a default_acl](https://kazoo.readthedocs.io/en/latest/api/client.html#kazoo.client.KazooClient.__init__) that will be applied on node creation.
Example configuration in Patroni might then be:
```
zookeeper:
set_acls:
CN=principal1: [ALL]
CN=principal2:
- READ
```
Add support for ETCD SRV name suffix as per description in ETCD dosc:
> The -discovery-srv-name flag additionally configures a suffix to the SRV name that is queried during discovery. Use this flag to differentiate between multiple etcd clusters under the same domain. For example, if discovery-srv=example.com and -discovery-srv-name=foo are set, the following DNS SRV queries are made:
>
> _etcd-server-ssl-foo._tcp.example.com
> _etcd-server-foo._tcp.example.com
All test passes, but not been tested on the live ETCD system yet... Please, take a look and send feedback.
Resolves#2028
If configured, only IPs that matching rules would be allowed to call unsafe endpoints.
In addition to that, it is possible to automatically include IPs of members of the cluster to the list.
If neither of the above is configured the old behavior is retained.
Partially address https://github.com/zalando/patroni/issues/1734
Effectively, this PR consists of a few changes:
1. The easy part:
In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.
2. The tricky part:
On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function.
When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.
3. Additional requirements:
In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.
4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.
Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.
The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.
Close: https://github.com/zalando/patroni/issues/1749
This commit makes it possible to configure the maximum lag (`maximum_lag_on_syncnode`) after which Patroni will "demote" the node from synchronous and replace it with another node.
The previous implementation always tried to stick to the same synchronous nodes (even if they are not optimal ones).
The Python SSL library allows for the inclusion of a password in its "load_cert_chain" function when setting up a SSLContext[1].
This allows for loading an encrypted key file in PEM representation to be loaded into the certificate chain.
This commit adds the optional "keyfile_password" parameter to the REST API block of configuration so that Patroni can load in encrypted private keys when establishing its TLS socket.
This also adds the corollary "PATRONI_RESTAPI_KEYFILE_PASSWORD" environmental variable, which has the same effect.
[1] https://docs.python.org/3/library/ssl.html#ssl.SSLContext.load_cert_chain
There are sometimes good reasons to manage replication slots externally
to Patroni. For example, a consumer may wish to manage its own slots (so
that it can more easily track when a failover has a occurred and whether
it is ahead of or behind the WAL position on the new primary).
Additionally tooling like pglogical actually replicates slots to all
replicas so that the current position can be maintained on failover
targets (this also aids consumers by supplying primitives so that they
can verify data hasn't been lost or a split brain occurred relative to
the physical cluster).
To support these use cases this new feature allows configuring Patroni
to entirely ignore sets of slots specified by any subset of name,
database, slot type, and plugin.
* update release notes
* bump version
* change the default alignment in patronictl table output to `left`
* add missing tests
* add missing pieces to the documentation
Call a fencing script after acquiring the leader lock. If the script didn't finish successfully - don't promote but remove leader key
Close https://github.com/zalando/patroni/issues/1567
When running on K8s Patroni is communicating with API via the `kubernetes` service, which is address is exposed via the
`KUBERNETES_SERVICE_HOST` environment variable. Like any other service, the `kubernetes` service is handled by `kube-proxy`, that depending on configuration is either relying on userspace program or `iptables` for traffic routing.
During K8s upgrade, when master nodes are replaced, it is possible that `kube-proxy` doesn't update the service configuration in time and as a result Patroni fails to update the leader lock and demotes postgres.
In order to improve the user experience and get more control on the problem we make it possible to bypass the `kubernetes` service and connect directly to API nodes.
The strategy is very simple:
1. Resolve list IPs of API nodes from the kubernetes endpoint on every iteration of HA loop.
2. Stick to one of these IPs for API requests
3. Switch to a different IP if connected to IP is not from the list
4. If the request fails, switch to another IP and retry
Such a strategy is already used for Etcd and proven to work quite well.
In order to enable the feature, you need either to set to `true` `kubernetes.bypass_api_service` in the Patroni configuration file or `PATRONI_KUBERNETES_BYPASS_API_SERVICE` environment variable.
If for some reason `GET /default/endpoints/kubernetes` isn't allowed Patroni will disable the feature.
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.
Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.
Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.
Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
* new node can join the cluster dynamically and become a part of consensus
* it is also possible to join only Patroni cluster (without adding the node to the raft), just comment or remove `raft.self_addr` for that
* when the node joins the cluster it is using values from `raft.partner_addrs` only for initial discovery.
* It is possible to run Patroni and Postgres on two nodes plus one node with `patroni_raft_controller` (without Patroni and Postgres). In such setup one can temporarily lose one node without affecting the primary.
## Feature: Postgres stop timeout
Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or hangs for some reason(Issue described in [1371](https://github.com/zalando/patroni/issues/1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor.
### master_stop_timeout
The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
It is a regular issue that primary is recycling WALs when one of the replicas is down for a long time. So far there were only two solutions for such a problem and both of them are not perfect:
1. Increase `wal_keep_segments`, but it is hard to guess the good value.
2. Use continuous archiving and PITR, but it is not always possible.
This PR is introducing the way to solve the problem for static clusters, with a fixed number of nodes and names that never change. You just need to list the names of all nodes in the `slots` so the primary will not remove the slot when the node is down (not registered in DCS).
Of course, the primary will not create the permanent slot which is matching its own name.
Usage example: let's assume you have a cluster with nodes named *abc1*, *abc2*, and *abc3*.
You have to run `patronictl edit-config` and put the following snippet into the configuration:
```yaml
slots:
abc1:
type: physical
abc2:
type: physical
abc3:
type: physical
```
If the node *abc2* is the primary, it will always create slots for *abc1* and *abc3* even if they are not running, but will not create slot *abc2*.
Other nodes will behave the same.
Close#280
* make it possible to use client certificates with REST API
* define a separate PatroniRequest class which handles all communication
* refactor patronictl to use the new class
* make Ha to use the new class instead of calling requests.get. The old call wasn't taking into account certificates and basic-auth
Close#898
A few times we observed that Patroni HA loop was blocked for a few minutes due to not being able to write logs to stderr. This is a very rare condition which we hit so far only on k8s. This commit makes Patroni resilient to such kind of problems. All log messages first are written into the in-memory queue and later they are asynchronously flushed into the stderr or file from a separate thread.
The maximum queue size is configurable and the default value is 1000. This should be enough to keep more than one hour of log messages with default settings and when Patroni cluster operates normally (without big issues).
In case if we hit the maximum size of the queue further logs will be discarded until the queue size will be reduced. The number of discarded messages will be reported into the log later.
In addition to that, the number of non-flushed and discarded messages (if there are any), will be reported via Patroni REST API as:
```json
"logger_queue_size": X,
"logger_records_lost": Y`
```
* expose the current patroni version in DCS
* expose `checkpoint_after_promote` flag in DCS as an indicator that pg_rewind could be safely executed
* other nodes will wait until this flag is set instead of connecting as superuser and issuing the CHECKPOINT
* define `postgresql.authention.rewind` with credentials for pg_rewind in patroni configuration files.
* create user for pg_rewind if postgres is 11+
* grant execute on functions required for pg_rewind to rewind user
This functionality works similarly to the `pg_hba`:
If the `postgresql.pg_ident` is defined in the config file or DCS, Patroni will write its value to pg_ident.conf, however, if `postgresql.parameters.ident_file` is defined, Patroni will assume that pg_ident is managed from outside and not update the file.
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.
Closes https://github.com/zalando/patroni/issues/988
if the `etcd.use_proxies` is set to true, Patroni will stick to the list of hosts specified in the `etcd.hosts` and avoid doing topology discovery. Such mode might be useful when you know that you connect to the etcd cluster via the set of proxies or when th etcd cluster has static topology.