If restarted in pause Patroni was discarding `synchronous_standby_names` from `postgresql.conf` because in the internal cache this values was set to `None`. As a result synchronous replication transitioned to a broken state, with no synchronous replicas according to the `synchronous_standby_names` and Patroni not selecting/setting the new synchronous replicas (another bug).
To solve the problem of broken initial state and to avoid similar issues with other GUC's we will read GUC's value if Patroni is joining running Postgres.
Patroni is changing `synchronous_standby_names` and the `/sync` key in a very specific order, first we add nodes to `synchronous_standby_names` and only after, when they are recognized as synchronous they are added to the `/sync` key. When removing nodes the order is different: they are first removed from the `/sync` key and only after that from the `synchronous_standby_names`.
As a result Patroni expects that either actual synchronous nodes will match with the nodes listed in the `/sync` key or that new candidates to synchronous nodes will not match with nodes listed in the `/sync` key. In case if `synchronous_standby_names` was removed from the `postgresql.conf`, manually, or due the the bug (#2876), the state becomes inconsistent because of the wrong order of updates.
To solve inconsistent state we introduce additional checks and will update the `/sync` key with actual names of synchronous nodes (usually empty set).
Patroni doesn't watch on all changes of member keys in order to not create too much load on ZooKeeper, but only subscribes to changes (ZNodes added or deleted) in the `/member` directory. Therefore when some important fields in the value are updated we remove and recreate ZNode in order to notify the leader or other members.
The leader should remove the member key only when the `checkpoint_after_promote` value is changed and replicas when the `state` is changed to/from `running`.
We don't care about the `version` field, because Patroni version can't be changed without restart, what will case ZooKeeper `session_id` to change it anyway.
This fix hopefully will reduce failures of behave tests on GH Actions.
Despite being validated by `IntValidator` some GUC's couldn't be casted directly to `int` because they include suffix. Example: `128MB`.
Close https://github.com/zalando/patroni/issues/2879
WARNING messages are produced by `urllib3` if Patroni is quickly restarted.
Instead we will check that the node is listen on a given port. This fact is actually enough to detect names clashes, while HTTP request could raise an exception is a few other cases, what might case false negatives.
Close https://github.com/zalando/patroni/issues/2881
1. Unit tests should not really try accessing any resources.
2. Not doing so results in significant execution time of unit tests on Windows
In addition to that perform a request with timeout 3s. Usually this is more than enough to figure out whether resource is accessible.
Followup on #2724
As was reported by @ants on Slack it could happen that `received_tli` is ahead of replayed timeline, therefore we should stop using it when deciding on pg_rewind if postgres is running and use only `IDENTIFY_SYSTEM` via replication connection.
Previous to this commit `patronictl query` was working only if `-r` argument was provided to the command. Otherwise it would face issues:
* If neither `-r` nor `-m` were provided:
```
PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT"
2023-09-12 17:45:38 No connection to role=None is available
```
* If only `-m` was provided:
```
$ PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT" -m postgresql0
2023-09-12 17:46:15 No connection to member postgresql0 is available
```
This issue was a regression introduced by `4c3e0b9382820524239d2aa4d6b95379ef1291db` through PR #2687.
Through that PR we decided to move the common logic used to check mutually exclusiveness of `--role` and `--member` arguments to `get_any_member` function.
However, previous to that change `role` variable would assume the default value of `any` in `query` method, before `get_any_member` was called, which was not the case after the change.
This commit fixes that issue by adding a handler in `get_cursor` function to `role=None`. As `role` defaulting to `any` is handled in a sub-call to `get_any_member`, we are apparently safe in `get_cursor` to return the cursor if `role=None`.
Unit tests were updated accordingly.
References: PAT-204.
Due to a race condition Patroni was falsely assuming that the standby should be restarted because some recovery parameters (primary_conninfo or similar) were changed.
Close https://github.com/zalando/patroni/issues/2834
When running with the leader lock Patroni was just setting the `role` label to `master` and effectively `kubernetes.standby_leader_label_value` feature never worked.
Now it is fixed, but in order to not introduce breaking changes we just update default value of the `standby_leader_label_value` to the `master`.
Cluster.get_replication_slots() didn't take into account that there can not be logical replication slots in a standby cluster replicas. It was only skipping logical slots for the standby_leader, but replicas were expecting that they will have to copy them over.
Also on replicas in a standby cluster these logical slots were falsely added to the `_replication_slots` dict.
Postgres supports two types of permissions:
1. owner only
2. group readable
By default the first one is used because it provides better security. But, sometimes people want to run a backup tool with the user that is different from postgres. In this case the second option becomes very useful. Unfortunately it didn't work correctly because Patroni was creating files with owner access only permissions.
This PR changes the behavior and permissions on files and directories that are created by Patroni will be calculated based on permissions of PGDATA. I.e., they will get group readable access when it is necessary.
Close#1899Close#1901
In issue #2735 it was discussed that there should be some warning around PostgreSQL parameters that do not pass validation.
This commit ensures something is logged for parameters that fail validation and therefore fall back to default values.
Close#2735Close#2740
Consider a following situation:
1. node1 is stressed so much that Patroni heart-beat can't run regularly and the leader lock expires.
2. node2 notice that there is no leader, gets the lock, promotes, and gets to a situation like it is described in 1.
3. Patroni on node1 finally wakes up, notice that Postgres is running as a primary, but without a leader lock and "happily" acquires the lock.
That is, node1 discarded promoting of node2, and the node2 after that it will not be possible to join the node2 back to the cluster, because pg_rewind is not possible when two nodes are on the same timeline.
To partially mitigate the problem we introduce an additional timeline check. If postgres is running as primary Patroni will consider it as a perfect candidate only if timeline isn't behind the last known cluster timeline recorder in the `/history` key. If postgres timeline is behind the cluster timeline postgres will be demoted to read-only. Further behavior would depend on `maximum_lag_on_failover` and `check_timeline` settings.
Since the `/history` key isn't updated instantly after promotion, there is still a short period of time when the issue could happen, but it seems that it is close to impossible to make it more reliable.
Close https://github.com/zalando/patroni/issues/2779
It was reported in #2779 that the primary was constantly logging messages like `Synchronous replication key updated by someone else`.
It happened after Patroni was stuck due to resource starvation. Key updates are performed using create_revision/mod_revision field, which value is taken from the internal cached. Hence, it is a clear symptom of stale cache.
Similar issues in K8s implementation were addressed by invalidating the cache and restarting watcher connections every time when update failed due to resource_version mismatch, so we do the same for Etcd3.
1. Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used.
2. Document that if `restapi.verify_client` is set to `required` then client certificates **must** be provided in the `ctl` section.
3. Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`.
4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.
Includes:
* renaming of `_unready_logical_slots` to better represent that it is a processing queue which is emptied on successful completion.
* ensuring that return type is consistent.
* made logic variable names explicit to help explain how the decision of whether a slot is "ready" is made.
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.
In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.
Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 | 0 |
+-------------+----------------+---------+---------------------+----+-----------+
$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
```
Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544480,
"replayed_location": 335544480,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "in archive recovery",
"dcs_last_seen": 1688642069,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544816,
"replayed_location": 335544816,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "streaming",
"dcs_last_seen": 1688642089,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
```
- make sure that physical replication slots are created even before the promote happened (when async executor is busy with promote).
- execute `txid_current()` with `synchronous_commit=off` so it doesn't accidentally wait for absent synchronous standbys when `synchronous_mode_strict` is enable and `synchronous_standby_names=*`. These standbys can't connect because replication slots weren't there.
- `synchronous_standby_names` wasn't set to `*` after bootstrap with `synchronous_mode` and `synchronous_mode_strict`.
- add `-c statement_timeout=0` to `PGOPTIONS` when executing `post_bootstrap` script.
Close https://github.com/zalando/patroni/issues/2738
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720.
In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`.
P.S. now we do not cover cases when Patroni was killed along with Postgres.
Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
When starting check if node with the same is registered in DCS and try to query it's REST API.
If REST API is accessible exit with the error.
Close#1804
Revert to using `ssl._ssl._test_decode_cert`
A change has been included as part of Patroni 3.0.3 release: use
public functions instead of `ssl._ssl._test_decode_cert` to get
serial number of certificates.
There was a slight bug in that implementation: it was only loading
the certificates through `load_verify_locations`, but was missing
to get the certificates through `get_ca_certs`. As a consequence
Patroni was not able anymore to reload REST API cert on SIGHUP.
An attempt to fix that issue was made through commit
`20f578f09f3aa604e5288710d4fd4e611152ed5f`. However, even with the
correct call of `get_ca_certs`, it was detected a corner case where
`load_verify_locations` would skip loading a certificate: if it was
issued with `CA:FALSE`. That essentially means the implementation is
still buggy in that situation. See [CPython](c283a0cff5/Modules/_ssl.c (L4618C14-L4619))
for the underlying problem.
In order to get back a functional implementation again we are reverting
the code to use the private function `ssl._ssl._test_decode_cert`.
We can later study a possible more elegant alternative for solving this,
if any.
---------
Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
- the `_in_fligh` attribute is accessed from multiple threads and must be protected with mutex when it is changed
- allow adding tasks for `_in_fligh.group` from the `sync_pg_dist_node()` method when timeout is reached. Not doing so might result is indefinite transaction if REST API request from worker node failed.
* Use YAML files to validate Postgres GUCs through Patroni.
Patroni used to have a static list of Postgres GUCs validators in
`patroni.postgresql.validator`.
One problem with that approach, for example, is that it would not
allow GUCs from custom Postgres builds to be validated/accepted.
The idea that we had to work around that issue was to move the
validators from the source code to an external and extendable source.
With that Patroni will start reading the current validators from that
external source plus whatever custom validators are found.
From this commit onwards Patroni will read and parse all YAML files
that are found under the `patroni/postgresql/available_parameters`
directory to build its Postgres GUCs validation rules.
All the details about how this work can be found in the docstring
of the introduced function `_load_postgres_gucs_validators`.
When using a custom Postgres distribution it may be the case that the Postgres binaries are compiled with different names other than the ones used by the community Postgres distribution.
With that in mind we implemented a new set of settings for Patroni, so the user is able to override the default binary names with custom binary names through the new section postgresql.bin_name in the local configuration.
References: PAT-17.
- abstract_main only creates Config object using the passed configfile
and instantiates the passed daemon class
- common args parser is extracted into a separate func that is called
from daemons' main funcs (specific args can be added afterwards)
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
pass reference to a last known leader object in order to avoid obtaining it from the `AbstractDCS.cluster` cache.
This change is useful for Consul, Etcd3 and Zookeeper implementations.
It didn't took into account the fact that we can get a new lease after changing a TTL. In this case we have to update the exiting leader key with the new lease.
To solve it we introduce the on transaction 'failure' callback. The whole workflow looks like (schematically):
```python
txn(
compare=(old_value == self._name)),
success=put(self.leader_path, self._name, self._lease),
failure=txn(
compare=(create_revision == '0'),
success=put(self.leader_path, self._name, self._lease)
)
)
```
The problem was introduced in d98d6d0b02c9cc67464a7b1b31b1c5570c26e12d
- Always pass etcd3 key revision as a string
- Make sure the leader key isn't unconditionally overwritten. It may happen that the leader heart-beat loop didn't run properly and the session has expired. In this case the leader may create a new session and a new leader key. But, there are chances that the other node already created a leader key and we don't want to overwrite it.
- Try to sync HA loops between nodes by adding 0.5 seconds to timeout on non-leader nodes
- fix --validate-config not to error out if bin_dir is an empty string in the yaml config
- mention bin_dir optionality in the docs
- validate bin_dir even if it is not in the yaml config (add optional
default value for Optional config params in validator)
- make rewind user optional
Make it return the new `SyncState` object in order to avoid reading the new cluster state in the Ha.process_sync_replication().
Now it is a small optimization, but it will become very handy in the quorum commit feature.
- added pyrightconfig.json with typeCheckingMode=strict
- added type hints to all files except api.py
- added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules
- added type stubs for psycopg2 and urllib3 with some little fixes
- fixes most of the issues reported by pyright
- remaining issues will be addressed later, along with enabling CI linting task
The two cases we have in mind are:
* In spite of following all best practices client-side, logical replication connections can sometimes hang the Postgres shutdown sequence. We'd like to sigterm any misbehaving logical replication connections which remain after x seconds. These will inevitably get killed anyway on master stop timeout.
* remove "role=master" label on current primary when not using k8s as DCS. Waiting until after Postgres fully stops can sometimes be too long for this.
* Pause pgbouncer connections before switchover
Close#2596