High CPU load on Etcd nodes and K8s API servers created a very strange situation. A few clusters were running without a leader and the pod which is ahead of others was failing to take a leader lock because updates were failing with HTTP response code `409` (`resource_version` mismatch).
Effectively that means that TCP connections to K8s master nodes were alive (in the opposite case tcp keepalives would have resolved it), but no `UPDATE` events were arriving via these connections, resulting in the stale cache of the cluster in memory.
The only good way to prevent this situation is to intercept 409 HTTP responses and terminate existing TCP connections used for watches.
Now a few words about implementation. Unfortunately, watch threads are waiting in the read() call most of the time and there is no good way to interrupt them. But, the `socket.shutdown()` seems to do this job. We already used this trick in the Etcd3 implementation.
This approach will help to mitigate the issue of not having a leader, but at the same time replicas might still end up with the stale cluster state cached and in the worst case will not stream from the leader. Non-streaming replicas are less dangerous and could be covered by monitoring and partially mitigated by correctly configured `archive_command` and `restore_command`.
When starting as a replica it may take some time before Postgres starts accepting new connections, but meanwhile, it could happen that the leader transitioned to a different member and the `primary_conninfo` must be updated.
On pre v12 Patroni regularly checks `recovery.conf` in order to check that recovery parameters match the expectation. Starting from v12 recovery parameters were converted to GUC's and Patroni gets current values from the `pg_settings` view. The last one creates a problem when it takes more than a minute for Postgres to start accepting new connections.
Since Patroni attempts to execute at least `pg_is_in_recovery()` every HA loop, and it is raising at exception, the `check_recovery_conf()` effectively wasn't reachable until recovery is finished, but it changed when #2082 was introduced.
As a result of #2082 we got the following behavior:
1. Up to v12 (not including) everything was working as expected
2. v12 and v13 - Patroni restarting Postgres after 1m of recovery
3. v14+ - the `check_recovery_conf()` is not executed because the `replay_paused()` method raising an exception.
In order to properly handle changes of recovery parameters or leader transitioned to a different node on v12+, we will rely on the cached values of recovery parameters until Postgres becomes ready to execute queries.
Close https://github.com/zalando/patroni/issues/2289
The logical slot on a replica is safe to use when the physical replica
slot on the primary:
1. has a nonzero/non-null `catalog_xmin`
2. has a `catalog_xmin` that is not newer (greater) than the `catalog_xmin` of any slot on the standby
3. the `catalog_xmin` is known to overtake `catalog_xmin` of logical slots on the primary observed during `1`
In case if `1` doesn't take place, Patroni will run an additional check whether the `hot_standby_feedback` is actually in effect and shows the warning in case it is not.
Since Kubernetes v1.21, with projected service account token feature, service account tokens expire in 1 hour. Kubernetes clients are expected to reread the token file to refresh the token.
This patch re-reads the token file very minute for in-cluster config.
Fixes#2286
Signed-off-by: Haitao Li <hli@atlassian.com>
A couple of times we have seen in the wild that the database for the permanent logical slots was changed in the Patroni config.
It resulted in the below situation.
On the primary:
1. The slot must be dropped before creating it in a different DB.
2. Patroni fails to drop it because the slot is in use.
Replica:
1. Patroni notice that the slot exists in the wrong DB and successfully dropping it.
2. Patroni copying the existing slot from the primary by its name with Postgres restart.
And the loop repeats while the "wrong" slot exists on the primary.
Basically, replicas are continuously restarting, which badly affects availability.
In order to solve the problem, we will perform additional checks while copying replication slot files from the primary and discard them if `slot_type`, `database`, or `plugin` don't match our expectations.
In case of DCS unavailability Patroni restarts Postgres in read-only.
It will cause pg_control to be updated with the `Database cluster state: in archive recovery` and also could set the `MinRecoveryPoint`.
When the next time Patroni is started it will assume that Postgres was running as a replica and rewind isn't required and will try to start the Postgres up. In this situation there is the chance that the start will be aborted with the FATAL error message that looks like `requested timeline 2 does not contain minimum recovery point 0/501E8B8 on timeline 1`.
On the next heart-beat Patroni will again notice that Postgres isn't running which would lead to another start-fail attempt.
This loop is endless.
In order to mitigate the problem we do the following:
1. While figuring out whether the rewind is required we consider `in archive recovery` along with `shut down in recovery`.
2. If pg_rewind is required and the cluster state is `in archive recovery` we also perform recovery in a single-user mode.
Close https://github.com/zalando/patroni/issues/2242
Patroni was falsely assuming that timelines have diverged.
For pg_rewind it didn't create any problem, but if pg_rewind is not allowed and the `remove_data_directory_on_diverged_timelines` is set, it resulted in reinitializing the former leader.
Close https://github.com/zalando/patroni/issues/2220
This allows to have multiple hosts in a standby_cluster and ensures that the standby leader follows the main cluster's new leader after a switchover.
Partially addresses #2189
When switching certificates there is a race condition with a concurrent API request. If there is one active during the replacement period then the replacement will error out with a port in use error and Patroni gets stuck in a state without an active API server.
Fix is to call server_close after shutdown which will wait for already running requests to complete before returning.
Close#2184
- Simplify setup.py: remove unneeded features and get rid of deprecation warnings
- Compatibility with Python 3.10: handle `threading.Event.isSet()` deprecation
- Make sure setup.py could run without `six`: move Patroni class and main function to the `__main__.py`. The `__init__.py` will have only a few functions used by the Patroni class and from the setup.py
When restore_command is configured Postgres is trying to fetch/apply all possible WAL segments and also fetch history files in order to select the correct timeline. It could result in a situation where the new history file will be missing some timelines.
Example:
- node1 demotes/crashes on timeline 1
- node2 promotes to timeline 2 and archives `00000002.history` and crashes
- node1 recovers as a replica, "replays" `00000002.history` and promotes to timeline 3
As a result, the `00000003.history` will not have the line with timeline 2, because it never replayed any WAL segment from it.
The `pg_rewind` tool is supposed to correctly handle such case when rewinding node2 from node1, but Patroni when deciding whether the rewind should happen was searching for the exact timeline in the history file from the new primary.
The solution is to assume that rewind is required if the current replica timeline is missing.
In addition to that this PR makes sure that the primary isn't running in recovery before starting the procedure of rewind check.
Close https://github.com/zalando/patroni/issues/2118 and https://github.com/zalando/patroni/issues/2124
1. Avoid doing CHECKPOINT if `pg_control` is already updated.
2. Explicitly call ensure_checkpoint_after_promote() right after the bootstrap finished successfully.
When deciding whether the ZNode should be updated we rely on the cached version of the cluster, which is updated only when members ZNodes are deleted/created or the `/status`, `/sync`, `/failover`, `/config`, or `/history` ZNodes are updated.
I.e. after the update of the current member ZNode succeeded the cache becomes stale and all further updates are always performed even if the value didn't change. In order to solve it, we introduce the new attribute in the Zookeeper class and will use it for memorizing the actual value and for later comparison.
In some extreme cases Postgres could be so slow that the normal monitoring query doesn't finish in a few seconds. It results in
the exception being raised from the `Postgresql._cluster_info_state_get()` method, which could lead to the situation that postgres isn't demoted on time.
In order to make it reliable we will catch the exception and use the cached state of postgres (`is_running()` and `role`) to determine whether postgres is running as a primary.
Close https://github.com/zalando/patroni/issues/2073
While doing demote due to failure to update leader lock it could happen that DCS goes completely down and the get_cluster() call raise the exception.
Not being properly handled it results in postgres remaining stopped until DCS recovers.
Due to different reasons, it could happen that WAL archiving on the primary stuck or significantly delayed. If we try to do a switchover or shut it down, the shutdown will take forever and will not finish until the whole backlog of WALs is processed.
In the meantime, Patroni keeps updating the leader lock, which prevents other nodes from starting the leader race even if it is known that they received/applied all changes.
The `Database cluster state:` is changed to `"shut down"` after:
- all data is fsynced to disk and the latest checkpoint is written to WAL
- all streaming replicas confirmed that they received all changes (including the latest checkpoint)
- at the same time, the archiver process continues to do its job and the postmaster process is still running.
In order to solve this problem and make the switchover more reliable/fast in a case when `archive_command` is slow/failing, Patroni will remove the leader key immediately after `pg_controldata` started reporting PGDATA as `"shut down"` cleanly and it verified that there is at least one replica that received all changes. If there are no replicas that fulfill the condition the leader key isn't removed and the old behavior is retained, i.e. Patroni will keep updating it.
This field notes the last time (as unix epoch) a cluster member has successfully communicated with the DCS. This is useful to identify and/or analyze network partitions.
Also, expose dcs_last_seen in the MemberStatus class and its from_api_response() method.
Add support for ETCD SRV name suffix as per description in ETCD dosc:
> The -discovery-srv-name flag additionally configures a suffix to the SRV name that is queried during discovery. Use this flag to differentiate between multiple etcd clusters under the same domain. For example, if discovery-srv=example.com and -discovery-srv-name=foo are set, the following DNS SRV queries are made:
>
> _etcd-server-ssl-foo._tcp.example.com
> _etcd-server-foo._tcp.example.com
All test passes, but not been tested on the live ETCD system yet... Please, take a look and send feedback.
Resolves#2028
If configured, only IPs that matching rules would be allowed to call unsafe endpoints.
In addition to that, it is possible to automatically include IPs of members of the cluster to the list.
If neither of the above is configured the old behavior is retained.
Partially address https://github.com/zalando/patroni/issues/1734
- Resolve Node IP for every connection attempt
- Handle exception with connection failures due to failed resolve
- Set PySyncObj DNS Cache timeouts aligned with `loop_wait` and `ttl`
In addition to that, postpone the leader race for freshly started Raft nodes. It will help with the situation when the leader node was alone and demoted the Postgres and after that, the replica arrives, and quickly takes the leader lock without really performing the leader race.
Close https://github.com/zalando/patroni/issues/1930, https://github.com/zalando/patroni/issues/1931
The #1527 introduced a feature of updating `/optime/leader` with the location of the last checkpoint after the Postgres was shutdown cleanly.
If wal archiving is enabled, Postgres always switching the WAL file before writing the checkpoint shutdown record. Normally it is not an issue, but for databases without too much write activity it could lead to the situation that the visible replication lag becomes equal to the size of a single WAL file. In fact, the previous WAL file is mostly empty and contains only a few records.
Therefore it should be safe to report the LSN of the SWITCH record before the shutdown checkpoint.
In order to do that, Patroni first gets the output of the pg_controldata and based on it calls pg_waldump two times:
* The first call reads the checkpoint record (and verifies that this is really the shutdown checkpoint).
* The next call reads the previous record and in case if it is the 'xlog switch' (for 9.3 and 9.4) or 'SWITCH' (for 9.5+), the LSN
of the SWITCH record is written to the `/optime/leader`.
In case of any mismatch, failure to call pg_waldump or parse its output, the old behavior is retained, i.e. `Latest checkpoint location` from the pg_controldata is used.
Close https://github.com/zalando/patroni/issues/1860
Old versions of `kazoo` immediately discarded all requests to Zookeeper if the connection is in the `SUSPENDED` state. This is absolutely fine because Patroni is handling retries on its own.
Starting from 2.7, kazoo started queueing requests instead of discarding and as a result, the Patroni HA loop was getting stuck until the connection to Zookeeper is reestablished, causing no demote of the Postgres.
In order to return to the old behavior we override the `KazooClient._call()` method.
In addition to that, we ensure that the `Postgresql.reset_cluster_info_state()` method is called even if DCS failed (the order of calls was changed in the #1820).
Close https://github.com/zalando/patroni/issues/1981
When joining already running Postgres, Patroni ensures that config files are set according to expectations.
With recovery parameters converted to GUCs in Postgres v12 it became a little problem, because when the `Postgresql` object is being created it is not yet known where the given replica is supposed to stream from.
It resulted in postgresql.conf first being written without recovery parameters, and on the next run of HA loop Patroni noticing inconsistencies and updating the config one more time.
For Postgres v12 it is not a big issue, but for v13+ it resulted in interruption of streaming replication.
PostgreSQL 14 changed the behavior of replicas when certain parameters (like for example `max_connections`) are changed (increased): https://github.com/postgres/postgres/commit/15251c0a.
Instead of immediately exiting Postgres 14 pauses replication and waits for actions from the operator.
Since the `pg_is_wal_replay_paused()` returning `True` is the only indicator of such a change, Patroni on the replica will call the `pg_wal_replay_resume()`, which would cause either continue replication or shutdown (like previously).
So far Patroni was never calling `pg_wal_replay_resume()` on its own, therefore, to remain backward compatible it will call it only for PostgreSQL 14+.
1. When everything goes normal, only one line will be written for every run of HA loop (see examples):
```
INFO: no action. I am (postgresql0) the leader with the lock
INFO: no action. I am a secondary (postgresql1) and following a leader (postgresql0)
```
2. The `does not have lock` became a debug message.
3. The `Lock owner: postgresql0; I am postgresql1` will be shown only when stream doesn't look normal.
Promoting the standby cluster requires updating load-balancer health checks, which is not very convenient and easy to forget.
In order to solve it, we change the behavior of the `/leader` health-check endpoint. It will return 200 without taking into account whether PostgreSQL is running as the primary or the standby_leader.
It could happen that the replica for some reason is missing the WAL file required by the replication slot.
The nature of this phenomenon is a bit unclear, it might be that the WAL was recycled short before we copied the slot file, but, we still need a solution to this problem. If the `pg_replication_slot_advance()` fails with the `UndefinedFile` exception (requested WAL segment pg_wal/... has already been removed), the logical slot on the replica must be recreated.
When the unix_socket_directories is not known Patroni was immediately going back to tcp connection via the localhost.
The bug was introduced in https://github.com/zalando/patroni/pull/1865
and run raft behave tests with encryption enabled.
Using the new `pysyncobj` release allowed us to get rid of a lot of hacks with accessing private properties and methods of the parent class and reduce the size of the `raft.py`.
Close https://github.com/zalando/patroni/issues/1746