1. Unit tests should not really try accessing any resources.
2. Not doing so results in significant execution time of unit tests on Windows
In addition to that perform a request with timeout 3s. Usually this is more than enough to figure out whether resource is accessible.
Followup on #2724
Expanding on the addition of docstrings in code, this adds python module API docs to sphinx documentation.
A developer can preview what this might look like by running this locally:
```
tox -m docs
```
The option `-W` is added to the tox env so that warning messages are considered errors.
Adds doc generation using the above method to the test GitHub workflow to catch documentation problems on PRs.
Some docstrings have been reformatted and fixed to satisfy errors generated with the above setup.
`patronictl` is implemented using `click` module, and that module uses the functions' docstrings for creating a helper text.
As a consequence the docstring for `ctl` function was being shown to the user, which doesn't make sense.
This PR fixes that issue by adding a user-friendly description to be shown on `patronictl --help`. We use a `\f` to tell `click` when to stop capturing text to show in the helper.
Note that `patronictl` commands are implemented using `@ctl.command` decorator, and we always provide them with `help` argument. That said, none of the subcommands are affected by the aforementioned issue, only the entry point of the CLI.
References: PAT-201.
As was reported by @ants on Slack it could happen that `received_tli` is ahead of replayed timeline, therefore we should stop using it when deciding on pg_rewind if postgres is running and use only `IDENTIFY_SYSTEM` via replication connection.
Previous to this commit `patronictl query` was working only if `-r` argument was provided to the command. Otherwise it would face issues:
* If neither `-r` nor `-m` were provided:
```
PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT"
2023-09-12 17:45:38 No connection to role=None is available
```
* If only `-m` was provided:
```
$ PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT" -m postgresql0
2023-09-12 17:46:15 No connection to member postgresql0 is available
```
This issue was a regression introduced by `4c3e0b9382820524239d2aa4d6b95379ef1291db` through PR #2687.
Through that PR we decided to move the common logic used to check mutually exclusiveness of `--role` and `--member` arguments to `get_any_member` function.
However, previous to that change `role` variable would assume the default value of `any` in `query` method, before `get_any_member` was called, which was not the case after the change.
This commit fixes that issue by adding a handler in `get_cursor` function to `role=None`. As `role` defaulting to `any` is handled in a sub-call to `get_any_member`, we are apparently safe in `get_cursor` to return the cursor if `role=None`.
Unit tests were updated accordingly.
References: PAT-204.
Due to a race condition Patroni was falsely assuming that the standby should be restarted because some recovery parameters (primary_conninfo or similar) were changed.
Close https://github.com/zalando/patroni/issues/2834
When running with the leader lock Patroni was just setting the `role` label to `master` and effectively `kubernetes.standby_leader_label_value` feature never worked.
Now it is fixed, but in order to not introduce breaking changes we just update default value of the `standby_leader_label_value` to the `master`.
Cluster.get_replication_slots() didn't take into account that there can not be logical replication slots in a standby cluster replicas. It was only skipping logical slots for the standby_leader, but replicas were expecting that they will have to copy them over.
Also on replicas in a standby cluster these logical slots were falsely added to the `_replication_slots` dict.
Previous to this commit `IntValidator` would always consider the value `0` invalid, even if in the allowed range.
The problem was that `parse_int` was returning `0` in the following line:
```python
value = parse_int(value, self.base_unit) or ""
```
However the `or ""` was evaluating to an empty string.
As `parse_int` returns either an `int` if able to parse, or `None` otherwise, the `isinstance(value, int)` is enough to error out when not a valid `int`.
Closes#2817
consider the scenario(enable failsafe_mode):
0. node1(primary) - node2(replica)
1. stop all etcd nodes; wait ttl seconds; start all etcd nodes; (node2's failsafe will contain the info about node1)
2. switchover to node2; (node2's failsafe still contain the info about node1)
3. stop all etcd nodes; wait ttl seconds; start all etcd nodes;
4. node2 will demote because it consider node1 as primary
Resetting failsafe state when running as a primary fixes the issue.
This PR is an attempt of refactoring the docs about migration to Patroni.
These are a few enhancements that we propose through this PR:
* Docs used to mention the procedure can only be performed in a single-node cluster. We changed that so the procedure considers a cluster composed of primary and standbys;
* Teach how to deal with pre-existing replication slots;
* Explain how to create the user for `pg_rewind`, if user intends to enable `use_pg_rewind`.
References: PAT-143.
Postgres supports two types of permissions:
1. owner only
2. group readable
By default the first one is used because it provides better security. But, sometimes people want to run a backup tool with the user that is different from postgres. In this case the second option becomes very useful. Unfortunately it didn't work correctly because Patroni was creating files with owner access only permissions.
This PR changes the behavior and permissions on files and directories that are created by Patroni will be calculated based on permissions of PGDATA. I.e., they will get group readable access when it is necessary.
Close#1899Close#1901
The docs of `slots` configuration used to have this mention:
```
my_slot_name: the name of replication slot. If the permanent slot name
matches with the name of the current primary it will not be created.
Everything else is the responsibility of the operator to make sure that
there are no clashes in names between replication slots automatically
created by Patroni for members and permanent replication slots.
```
However that is not true in the sense that Patroni does not check for
clashes between `my_slot_name` and the name of replication slots created
for replicating changes among members. If you specify a slot name that
clashes with the name of a replication slot used by a member, it turns
out Patroni will make the slot permanent in the primary even if the member
key expire from the DCS.
Through this commit we also enhance the docs in terms of explaining that
physical permanent slots are maintained only in the primary, while logical
replication slots are copied from primary to standbys.
Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
The node `postgresql2` in the repository is apparently supposed
to be a cascading standby.
However, if that is the case, there is a typo in the name of its
upstream node.
This commit fixes that issue.
In issue #2735 it was discussed that there should be some warning around PostgreSQL parameters that do not pass validation.
This commit ensures something is logged for parameters that fail validation and therefore fall back to default values.
Close#2735Close#2740
Consider a following situation:
1. node1 is stressed so much that Patroni heart-beat can't run regularly and the leader lock expires.
2. node2 notice that there is no leader, gets the lock, promotes, and gets to a situation like it is described in 1.
3. Patroni on node1 finally wakes up, notice that Postgres is running as a primary, but without a leader lock and "happily" acquires the lock.
That is, node1 discarded promoting of node2, and the node2 after that it will not be possible to join the node2 back to the cluster, because pg_rewind is not possible when two nodes are on the same timeline.
To partially mitigate the problem we introduce an additional timeline check. If postgres is running as primary Patroni will consider it as a perfect candidate only if timeline isn't behind the last known cluster timeline recorder in the `/history` key. If postgres timeline is behind the cluster timeline postgres will be demoted to read-only. Further behavior would depend on `maximum_lag_on_failover` and `check_timeline` settings.
Since the `/history` key isn't updated instantly after promotion, there is still a short period of time when the issue could happen, but it seems that it is close to impossible to make it more reliable.
Close https://github.com/zalando/patroni/issues/2779
It was reported in #2779 that the primary was constantly logging messages like `Synchronous replication key updated by someone else`.
It happened after Patroni was stuck due to resource starvation. Key updates are performed using create_revision/mod_revision field, which value is taken from the internal cached. Hence, it is a clear symptom of stale cache.
Similar issues in K8s implementation were addressed by invalidating the cache and restarting watcher connections every time when update failed due to resource_version mismatch, so we do the same for Etcd3.
On Slack user reported that Patroni didn't run pg_rewind on one of the nodes after coming out of maintenance mode.
Steps that were executed:
0. The initial state: node1 - primary, node2 - replica
1. `patronictl pause`
2. On node2: pg_ctl promote
3. On node1: pg_ctl stop
4. Patroni on node1 notice that Postgres isn't running and removes the leader lock
5. Patroni on node2 notice that Postgres is running as a primary and takes the leader lock.
6. `patronictl resume`.
After that node1 started saying in logs:
`Waiting for checkpoint on node2 before rewind`.
Repeating this steps may not necessarily reproduce the problem, because presumably pg_rewind failed earlier on node1.
Such situation was possible because promote wasn't executed by Patroni and therefore `Rewind._state` wasn't explicitly reset and the code that ensures that CHECKPOINT after promote was run wasn't triggered.
As a mitigation following changes have been made:
1. retrigger pg_rewind checks after coming out of maintenance mode
2. run ensure CHECKPOINT after promote checks using `Rewind._state != REWIND_STATUS.CHECKPOINT` condition. It allowed to remove useless hook from `Postgresql.promote()`.
When we don't have permanent logical slots Patroni was updating the `/status` on every heart-beat loop even when LSN on the primary isn't moving forward.
The issue was introduced in the #2485
1. Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used.
2. Document that if `restapi.verify_client` is set to `required` then client certificates **must** be provided in the `ctl` section.
3. Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`.
4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.
Includes:
* renaming of `_unready_logical_slots` to better represent that it is a processing queue which is emptied on successful completion.
* ensuring that return type is consistent.
* made logic variable names explicit to help explain how the decision of whether a slot is "ready" is made.
If one of backends crashed postmaster stops all backend and does crash recovery because shared memory could be corrupted. If something happens during this phase postgres may completely crash leaving pg_control with "in crash recovery" state.
Followup on #2726