When starting as a replica it may take some time before Postgres starts accepting new connections, but meanwhile, it could happen that the leader transitioned to a different member and the `primary_conninfo` must be updated.
On pre v12 Patroni regularly checks `recovery.conf` in order to check that recovery parameters match the expectation. Starting from v12 recovery parameters were converted to GUC's and Patroni gets current values from the `pg_settings` view. The last one creates a problem when it takes more than a minute for Postgres to start accepting new connections.
Since Patroni attempts to execute at least `pg_is_in_recovery()` every HA loop, and it is raising at exception, the `check_recovery_conf()` effectively wasn't reachable until recovery is finished, but it changed when #2082 was introduced.
As a result of #2082 we got the following behavior:
1. Up to v12 (not including) everything was working as expected
2. v12 and v13 - Patroni restarting Postgres after 1m of recovery
3. v14+ - the `check_recovery_conf()` is not executed because the `replay_paused()` method raising an exception.
In order to properly handle changes of recovery parameters or leader transitioned to a different node on v12+, we will rely on the cached values of recovery parameters until Postgres becomes ready to execute queries.
Close https://github.com/zalando/patroni/issues/2289
This allows to have multiple hosts in a standby_cluster and ensures that the standby leader follows the main cluster's new leader after a switchover.
Partially addresses #2189
In some extreme cases Postgres could be so slow that the normal monitoring query doesn't finish in a few seconds. It results in
the exception being raised from the `Postgresql._cluster_info_state_get()` method, which could lead to the situation that postgres isn't demoted on time.
In order to make it reliable we will catch the exception and use the cached state of postgres (`is_running()` and `role`) to determine whether postgres is running as a primary.
Close https://github.com/zalando/patroni/issues/2073
Due to different reasons, it could happen that WAL archiving on the primary stuck or significantly delayed. If we try to do a switchover or shut it down, the shutdown will take forever and will not finish until the whole backlog of WALs is processed.
In the meantime, Patroni keeps updating the leader lock, which prevents other nodes from starting the leader race even if it is known that they received/applied all changes.
The `Database cluster state:` is changed to `"shut down"` after:
- all data is fsynced to disk and the latest checkpoint is written to WAL
- all streaming replicas confirmed that they received all changes (including the latest checkpoint)
- at the same time, the archiver process continues to do its job and the postmaster process is still running.
In order to solve this problem and make the switchover more reliable/fast in a case when `archive_command` is slow/failing, Patroni will remove the leader key immediately after `pg_controldata` started reporting PGDATA as `"shut down"` cleanly and it verified that there is at least one replica that received all changes. If there are no replicas that fulfill the condition the leader key isn't removed and the old behavior is retained, i.e. Patroni will keep updating it.
The #1527 introduced a feature of updating `/optime/leader` with the location of the last checkpoint after the Postgres was shutdown cleanly.
If wal archiving is enabled, Postgres always switching the WAL file before writing the checkpoint shutdown record. Normally it is not an issue, but for databases without too much write activity it could lead to the situation that the visible replication lag becomes equal to the size of a single WAL file. In fact, the previous WAL file is mostly empty and contains only a few records.
Therefore it should be safe to report the LSN of the SWITCH record before the shutdown checkpoint.
In order to do that, Patroni first gets the output of the pg_controldata and based on it calls pg_waldump two times:
* The first call reads the checkpoint record (and verifies that this is really the shutdown checkpoint).
* The next call reads the previous record and in case if it is the 'xlog switch' (for 9.3 and 9.4) or 'SWITCH' (for 9.5+), the LSN
of the SWITCH record is written to the `/optime/leader`.
In case of any mismatch, failure to call pg_waldump or parse its output, the old behavior is retained, i.e. `Latest checkpoint location` from the pg_controldata is used.
Close https://github.com/zalando/patroni/issues/1860
When joining already running Postgres, Patroni ensures that config files are set according to expectations.
With recovery parameters converted to GUCs in Postgres v12 it became a little problem, because when the `Postgresql` object is being created it is not yet known where the given replica is supposed to stream from.
It resulted in postgresql.conf first being written without recovery parameters, and on the next run of HA loop Patroni noticing inconsistencies and updating the config one more time.
For Postgres v12 it is not a big issue, but for v13+ it resulted in interruption of streaming replication.
PostgreSQL 14 changed the behavior of replicas when certain parameters (like for example `max_connections`) are changed (increased): https://github.com/postgres/postgres/commit/15251c0a.
Instead of immediately exiting Postgres 14 pauses replication and waits for actions from the operator.
Since the `pg_is_wal_replay_paused()` returning `True` is the only indicator of such a change, Patroni on the replica will call the `pg_wal_replay_resume()`, which would cause either continue replication or shutdown (like previously).
So far Patroni was never calling `pg_wal_replay_resume()` on its own, therefore, to remain backward compatible it will call it only for PostgreSQL 14+.
When the unix_socket_directories is not known Patroni was immediately going back to tcp connection via the localhost.
The bug was introduced in https://github.com/zalando/patroni/pull/1865
Effectively, this PR consists of a few changes:
1. The easy part:
In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.
2. The tricky part:
On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function.
When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.
3. Additional requirements:
In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.
4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.
Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.
The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.
Close: https://github.com/zalando/patroni/issues/1749
This commit makes it possible to configure the maximum lag (`maximum_lag_on_syncnode`) after which Patroni will "demote" the node from synchronous and replace it with another node.
The previous implementation always tried to stick to the same synchronous nodes (even if they are not optimal ones).
When there is no config key in DCS Patroni shouldn't try accessing ignore_slots, otherwise an exception is raised.
In addition to that implement missing unit-tests and fix linting issues in behave tests.
* update release notes
* bump version
* change the default alignment in patronictl table output to `left`
* add missing tests
* add missing pieces to the documentation
So far Patroni was performing a comparison of the old value (in the `pg_settings`) with the new value (from Patroni configuration or from DCS) in order to figure out if reload or restart is required when the parameter has been changed. If the given parameter was missing in the `pg_settings` Patroni was ignoring it and not writing into the `postgresql.conf`.
In case if Postgres is not running, no validation has been performed and parameters and values were written into the config as it is.
It is not a very common mistake, but people tend to mistype parameter names or values.
Also, it happens that some parameters are removed in specific Postgres versions and some new are added (e.g. `checkpoint_segments` replaced with `min_wal_size` and `max_wal_size` in 9.5 or` wal_keep_segments` was replaced with `wal_keep_size` in 13).
Writing nonexistent parameters or invalid values into the `postgresql.conf` makes postgres unstartable.
This change doesn't solve the issue 100%, but at least approaching this goal very close.
Call a fencing script after acquiring the leader lock. If the script didn't finish successfully - don't promote but remove leader key
Close https://github.com/zalando/patroni/issues/1567
For init processes that use a symlinked WAL directory, or use custom scripts that create new tablespaces, these directories should also be renamed after a failed init attempt, as currently the following errors occur if the first init attempt failed, but a second one might succeed:
fixing permissions on existing directory /var/lib/postgresql/data ... ok
initdb: error: directory "/var/lib/postgresql/wal/pg_wal" exists but is not empty
[...]
File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1173, in post_bootstrap
self.cancel_initialization()
File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1168, in cancel_initialization
raise PatroniException('Failed to bootstrap cluster')
patroni.exceptions.PatroniException: 'Failed to bootstrap cluster'
In the remove_data_directory function the same happens for removing the data directory, it seems the same kind of thing should also happen when moving a data directory.
To ensure the data directory can still be used, the symlinks will point to the renamed directories.
The new parameter `synchronous_node_count` is used by Patroni to manage number of synchronous standby databases. It is set to 1 by default. It has no effect when synchronous_mode is set to off. When enabled, Patroni manages precise number of synchronous standby databases based on parameter synchronous_node_count and adjusts the state in DCS & synchronous_standby_names as members join and leave.
This functionality can be further extended to support Priority (FIRST n) based synchronous replication & Quorum (ANY n) based synchronous replication in future.
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.
Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.
Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.
Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
So far Patroni was parsing `recovery.conf` or querying `pg_settings` in order to get the current values of recovery parameters. On PostgreSQL earlier than 12 it could easily happen that the value of `primary_conninfo` in the `recovery.conf` has nothing to do with reality. Luckily for us, on PostgreSQL 9.6+ there is a `pg_stat_wal_receiver` view, which contains current values of `primary_conninfo` and `primary_slot_name`. The password field is masked through, but this is fine, because authentication happens only during opening the connection. All other parameters we compare as usual.
Another advantage of `pg_stat_wal_recevier` - it contains the current timeline, therefore on 9.6+ we don't need to use the replication connection trick if walreceiver process is alive.
If there is no walreceiver process available or it is not streaming we will stick to old methods.
when Patroni is trying to figure out the necessity of pg_rewind it could write the content history file from the primary into the log. The history file is growing with every failover/switchover and eventually starts taking too many lines in the log, most of them are not so much useful.
Instead of showing the raw data, we will show only 3 lines before the current replica timeline and 2 lines after.
## Feature: Postgres stop timeout
Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or hangs for some reason(Issue described in [1371](https://github.com/zalando/patroni/issues/1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor.
### master_stop_timeout
The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
It is a regular issue that primary is recycling WALs when one of the replicas is down for a long time. So far there were only two solutions for such a problem and both of them are not perfect:
1. Increase `wal_keep_segments`, but it is hard to guess the good value.
2. Use continuous archiving and PITR, but it is not always possible.
This PR is introducing the way to solve the problem for static clusters, with a fixed number of nodes and names that never change. You just need to list the names of all nodes in the `slots` so the primary will not remove the slot when the node is down (not registered in DCS).
Of course, the primary will not create the permanent slot which is matching its own name.
Usage example: let's assume you have a cluster with nodes named *abc1*, *abc2*, and *abc3*.
You have to run `patronictl edit-config` and put the following snippet into the configuration:
```yaml
slots:
abc1:
type: physical
abc2:
type: physical
abc3:
type: physical
```
If the node *abc2* is the primary, it will always create slots for *abc1* and *abc3* even if they are not running, but will not create slot *abc2*.
Other nodes will behave the same.
Close#280
Starting from PostgreSQL 12 the following recovery parameters could be changed without restart, but Patroni didn't yet support it:
* archive_cleanup_command
* promote_trigger_file
* recovery_end_command
* recovery_min_apply_delay
In future postgres releases this list will be extended and Patroni will support it automatically.
Previously check_recovery_conf() function was only checking whether primary_conninfo has changed and never taking into account all other recovery parameters.
Fixes https://github.com/zalando/patroni/issues/1201
Not doing so makes it hard to implement callbacks in bash and eventually can lead to the situation when two callbacks are running at the same time. In case if we failed to kill the child process we will still wait for it to finish.
The same problem could happen with custom bootstrap, therefore if we happen to kill the custom bootstrap process we also kill all child subprocesses.
Closes https://github.com/zalando/patroni/issues/1238
Recently it has happened two times when people tried to deploy the new cluster but postgres data directory wasn't empty and also wasn't valid. In this case Patroni was still creating initialize key in DCS and trying to start the postgres up.
Now it will complain about non-empty invalid postgres data directory and exit.
Close https://github.com/zalando/patroni/issues/1216
In addition to that try to protect from the case when some recovery parameters are set in one of included files by explicitly setting their value to an empty string on postgres 12.
Simplifies https://github.com/zalando/patroni/pull/1208
It is possible that some config files are not controlled by Patroni and when somebody is doing reload via REST API or by sending SIGHUP to Patroni process the usual expectation is that postgres will also be reloaded, but it didn't happen when there were no changes in the postgresql section of Patroni config.
For example one might replace ssl_cert_file and ssl_key_file on the filesystem and starting from PostgreSQL 10 it just requires a reload, but Patroni wasn't doing it.
In addition to that fix the issue with handling of `wal_buffers`. The default value depends on `shared_buffers` and `wal_segment_size` and therefore Patroni was exposing pending_restart when the new value in the config was explicitly set to -1 (default).
Close https://github.com/zalando/patroni/issues/1198
* use `SHOW primary_conninfo` instead of parsing config file on pg12
* strip out standby and recovery parameters from postgresql.auto.conf before starting the postgres 12
Patroni config remains backward compatible.
Despite for example `restore_command` converted to a GUC starting from postgresql 12, in the Patroni configuration you can still keep it in the `postgresql.recovery_conf` section.
If you put it into `postgresql.parameters.restore_command`, that will also work, but it is important not to mix both ways:
```yaml
# is OK
postgresql:
parameters:
restore_command: my_restore_command
archive_cleanup_command: my_archive_cleanup_command
# is OK
postgresql:
recovery_conf:
restore_command: my_restore_command
archive_cleanup_command: my_archive_cleanup_command
# is NOT ok
postgresql:
parameters:
restore_command: my_restore_command
recovery_conf:
archive_cleanup_command: my_archive_cleanup_command
```
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package
This is just a careful refactoring, without code changes.
* expose the current patroni version in DCS
* expose `checkpoint_after_promote` flag in DCS as an indicator that pg_rewind could be safely executed
* other nodes will wait until this flag is set instead of connecting as superuser and issuing the CHECKPOINT
* define `postgresql.authention.rewind` with credentials for pg_rewind in patroni configuration files.
* create user for pg_rewind if postgres is 11+
* grant execute on functions required for pg_rewind to rewind user
This functionality works similarly to the `pg_hba`:
If the `postgresql.pg_ident` is defined in the config file or DCS, Patroni will write its value to pg_ident.conf, however, if `postgresql.parameters.ident_file` is defined, Patroni will assume that pg_ident is managed from outside and not update the file.
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.
Closes https://github.com/zalando/patroni/issues/988