High CPU load on Etcd nodes and K8s API servers created a very strange situation. A few clusters were running without a leader and the pod which is ahead of others was failing to take a leader lock because updates were failing with HTTP response code `409` (`resource_version` mismatch).
Effectively that means that TCP connections to K8s master nodes were alive (in the opposite case tcp keepalives would have resolved it), but no `UPDATE` events were arriving via these connections, resulting in the stale cache of the cluster in memory.
The only good way to prevent this situation is to intercept 409 HTTP responses and terminate existing TCP connections used for watches.
Now a few words about implementation. Unfortunately, watch threads are waiting in the read() call most of the time and there is no good way to interrupt them. But, the `socket.shutdown()` seems to do this job. We already used this trick in the Etcd3 implementation.
This approach will help to mitigate the issue of not having a leader, but at the same time replicas might still end up with the stale cluster state cached and in the worst case will not stream from the leader. Non-streaming replicas are less dangerous and could be covered by monitoring and partially mitigated by correctly configured `archive_command` and `restore_command`.
The current state of permanent logical replication slots on the primary is queried together with `pg_current_wal_lsn()` and hence they "fail" simultaneously if Postgres isn't yet ready for accepting connections and in this case we want to avoid updating the `/status` key altogether.
On K8s we don't use a dedicated object for the `/status` key, but use the same object (Endpoint or ConfigMap) as for the leader. If the `last_lsn` isn't set we avoid patching the corresponding annotation, but, the `slots` annotation was reset due to the oversight.
When starting as a replica it may take some time before Postgres starts accepting new connections, but meanwhile, it could happen that the leader transitioned to a different member and the `primary_conninfo` must be updated.
On pre v12 Patroni regularly checks `recovery.conf` in order to check that recovery parameters match the expectation. Starting from v12 recovery parameters were converted to GUC's and Patroni gets current values from the `pg_settings` view. The last one creates a problem when it takes more than a minute for Postgres to start accepting new connections.
Since Patroni attempts to execute at least `pg_is_in_recovery()` every HA loop, and it is raising at exception, the `check_recovery_conf()` effectively wasn't reachable until recovery is finished, but it changed when #2082 was introduced.
As a result of #2082 we got the following behavior:
1. Up to v12 (not including) everything was working as expected
2. v12 and v13 - Patroni restarting Postgres after 1m of recovery
3. v14+ - the `check_recovery_conf()` is not executed because the `replay_paused()` method raising an exception.
In order to properly handle changes of recovery parameters or leader transitioned to a different node on v12+, we will rely on the cached values of recovery parameters until Postgres becomes ready to execute queries.
Close https://github.com/zalando/patroni/issues/2289
The logical slot on a replica is safe to use when the physical replica
slot on the primary:
1. has a nonzero/non-null `catalog_xmin`
2. has a `catalog_xmin` that is not newer (greater) than the `catalog_xmin` of any slot on the standby
3. the `catalog_xmin` is known to overtake `catalog_xmin` of logical slots on the primary observed during `1`
In case if `1` doesn't take place, Patroni will run an additional check whether the `hot_standby_feedback` is actually in effect and shows the warning in case it is not.
Since Kubernetes v1.21, with projected service account token feature, service account tokens expire in 1 hour. Kubernetes clients are expected to reread the token file to refresh the token.
This patch re-reads the token file very minute for in-cluster config.
Fixes#2286
Signed-off-by: Haitao Li <hli@atlassian.com>
A couple of times we have seen in the wild that the database for the permanent logical slots was changed in the Patroni config.
It resulted in the below situation.
On the primary:
1. The slot must be dropped before creating it in a different DB.
2. Patroni fails to drop it because the slot is in use.
Replica:
1. Patroni notice that the slot exists in the wrong DB and successfully dropping it.
2. Patroni copying the existing slot from the primary by its name with Postgres restart.
And the loop repeats while the "wrong" slot exists on the primary.
Basically, replicas are continuously restarting, which badly affects availability.
In order to solve the problem, we will perform additional checks while copying replication slot files from the primary and discard them if `slot_type`, `database`, or `plugin` don't match our expectations.
On Debian/Ubuntu systems it is common to keep Postgres config files outside of the data directory.
It created a couple of problems for pg_rewind support in Patroni.
1. The `--config_file` argument must be supplied while figuring out the `restore_command` GUC value on Postgres v12+
2. With Postgres v13+ pg_rewind by itself can't find postgresql.conf in order to figure out `restore_command` and therefore we have to use Patroni as a fallback for fetching missing WAL's that are required for rewind.
This commit addresses both problems.
In case of DCS unavailability Patroni restarts Postgres in read-only.
It will cause pg_control to be updated with the `Database cluster state: in archive recovery` and also could set the `MinRecoveryPoint`.
When the next time Patroni is started it will assume that Postgres was running as a replica and rewind isn't required and will try to start the Postgres up. In this situation there is the chance that the start will be aborted with the FATAL error message that looks like `requested timeline 2 does not contain minimum recovery point 0/501E8B8 on timeline 1`.
On the next heart-beat Patroni will again notice that Postgres isn't running which would lead to another start-fail attempt.
This loop is endless.
In order to mitigate the problem we do the following:
1. While figuring out whether the rewind is required we consider `in archive recovery` along with `shut down in recovery`.
2. If pg_rewind is required and the cluster state is `in archive recovery` we also perform recovery in a single-user mode.
Close https://github.com/zalando/patroni/issues/2242
Patroni was falsely assuming that timelines have diverged.
For pg_rewind it didn't create any problem, but if pg_rewind is not allowed and the `remove_data_directory_on_diverged_timelines` is set, it resulted in reinitializing the former leader.
Close https://github.com/zalando/patroni/issues/2220
This allows to have multiple hosts in a standby_cluster and ensures that the standby leader follows the main cluster's new leader after a switchover.
Partially addresses #2189
This PR adds metrics for additional information :
- If a node or cluster is pending restart,
- If the cluster management is paused.
This may be useful for Prometheus/Grafana monitoring.
Close#2198
When switching certificates there is a race condition with a concurrent API request. If there is one active during the replacement period then the replacement will error out with a port in use error and Patroni gets stuck in a state without an active API server.
Fix is to call server_close after shutdown which will wait for already running requests to complete before returning.
Close#2184
It could be that `remove_data_directory_on_diverged_timelines` is set, but there is no `rewind_credentials` defined and superuser access between nodes is not allowed.
Close https://github.com/zalando/patroni/issues/2162
- Simplify setup.py: remove unneeded features and get rid of deprecation warnings
- Compatibility with Python 3.10: handle `threading.Event.isSet()` deprecation
- Make sure setup.py could run without `six`: move Patroni class and main function to the `__main__.py`. The `__init__.py` will have only a few functions used by the Patroni class and from the setup.py
When figuring out which slots should be created on cascading standby we forgot to take into account that the leader might be absent.
Close: https://github.com/zalando/patroni/issues/2137
When starting postgres after bootstrap of the standby-leader the `follow()` method is used to always return `True`.
This behavior was changed in the #2054 in order to avoid hammering logs if postgres is failing to start.
Since now the method returns `None` if postgres didn't start accepting connections after 60s, the change broke the standby-leader bootstrap code.
As the solution, we will assume that the clone was successful if the `follow()` method returned anything different from `False`.
- remove codacy steps: they removed legacy organizations and there seems to be no easy way of installing codacy app to the Zalando GH.
- Don't run behave on MacOS: recently worker became way to slow
- Disable behave for combination of kubernetes and python 2.7
- Remove python 3.5 (it will be removed by GH from workers in January) and add 3.10
- Run behave with 3.6 and 3.9 instead of 3.5 and 3.8
When restore_command is configured Postgres is trying to fetch/apply all possible WAL segments and also fetch history files in order to select the correct timeline. It could result in a situation where the new history file will be missing some timelines.
Example:
- node1 demotes/crashes on timeline 1
- node2 promotes to timeline 2 and archives `00000002.history` and crashes
- node1 recovers as a replica, "replays" `00000002.history` and promotes to timeline 3
As a result, the `00000003.history` will not have the line with timeline 2, because it never replayed any WAL segment from it.
The `pg_rewind` tool is supposed to correctly handle such case when rewinding node2 from node1, but Patroni when deciding whether the rewind should happen was searching for the exact timeline in the history file from the new primary.
The solution is to assume that rewind is required if the current replica timeline is missing.
In addition to that this PR makes sure that the primary isn't running in recovery before starting the procedure of rewind check.
Close https://github.com/zalando/patroni/issues/2118 and https://github.com/zalando/patroni/issues/2124