* Ensure that nofailover will always be used if both nofailover and
failover_priority tags are provided
* Call _validate_failover_tags from reload_local_configuration() as well
* Properly check values in the _validate_failover_tags(): nofailover value should be casted to boolean like it is done when accessed in other places
Previous to this commit, if a user would ever like to add parameters to the custom bootstrap script call, they would need to configure Patroni like this:
```
bootstrap:
method: custom_method_name
custom_method_name:
command: /path/to/my/custom_script --arg1=value1 --arg2=value2 ...
```
This commit extends that so we achieve a similar behavior that is seen when using `create_replica_methods`, i.e., we also allow the following syntax:
```
bootstrap:
method: custom_method_name
custom_method_name:
command: /path/to/my/custom_script
arg1: value1
arg2: value2
```
All keys in the mapping which are not recognized by Patroni, will be dealt with as if they were additional named arguments to be passed down to the `command` call.
References: PAT-218.
The priority is configured with `failover_priority` tag. Possible values are from `0` till infinity, where `0` means that the node will never become the leader, which is the same as `nofailover` tag set to `true`. As a result, in the configuration file one should set only one of `failover_priority` or `nofailover` tags.
The failover priority kicks in only when there are more than one node have the same receive/replay LSN and are ahead of other nodes in the cluster. In this case the node with higher value of `failover_priority` is preferred. If there is a node with higher values of receive/replay LSN, it will become the new leader even if it has lower value of `failover_priority` (except when priority is set to 0).
Close https://github.com/zalando/patroni/issues/2759
- Fixed issues with has_permanent_slots() method. It didn't took into account the case of permanent physical slots for members, falsely concluding that there are no permanent slots.
- Write to the status key only LSNs for permanent slots (not just for slots that exist on the primary).
- Include pg_current_wal_flush_lsn() to slots feedback, so that slots on standby nodes could be advanced
- Improved behave tests:
- Verify that permanent slots are properly created on standby nodes
- Verify that permanent slots are properly advanced, including DCS failsafe mode
- Verify that only permanent slots are written to the `/status`
Cache creates a lot of problems and prevents implementing a feature of automatic retention of physical replication slots for members with configurable retention policy.
Just read the entire cluster from Zookeeper instead and use watchers only for the `/leader` and `/config` keys.
1. Introduce DEBUG logs for callbacks
2. Configure log format in behave tests to include filename, line, and method name that triggered the callback and enable DEBUG logs for `patroni.postgresql.callback_executor` module.
P.S. unfortunately it works only starting from python 3.8, but it should be good enough for debug purpose because 3.7 is already EOL.
- get rid from sleeps
- reduce retry_timeout
- avoid graceful Patroni shut down while DCS is "paused", just kill
Patroni and after that gracefully stop postgres
- don't try to delete Pod when Patroni is killed. If K8s API is paused it takes ages
The run time on my laptop is reduced from 2m to 1m28s.
Patroni doesn't watch on all changes of member keys in order to not create too much load on ZooKeeper, but only subscribes to changes (ZNodes added or deleted) in the `/member` directory. Therefore when some important fields in the value are updated we remove and recreate ZNode in order to notify the leader or other members.
The leader should remove the member key only when the `checkpoint_after_promote` value is changed and replicas when the `state` is changed to/from `running`.
We don't care about the `version` field, because Patroni version can't be changed without restart, what will case ZooKeeper `session_id` to change it anyway.
This fix hopefully will reduce failures of behave tests on GH Actions.
Create permanent physical replication slots on standby nodes and use `pg_replication_slot_advance()` function to move them forward.
The `restart_lsn` is advanced based on values stored in the `/status` key by the primary node.
When slot is created on a replica it could be ahead the same slot on the primary and therefore there is some period of time when it doesn't protect WAL files from being recycled.
* Use virtualenv to install tox in behave Dockerfile
Upstream change in postgres docker image uses debian restriction on
installing system-wide non-debian python packages. Debian doesn't
provide a tox>=4, so we need to install with pip.
* Exclude all output directories generated using `tox-wrapper.sh`
The `tox-wrapper.sh` script created by `features/Dockerfile` creates
directories like features/output-tox-pg14-docker-behave-etcd-lin-973719674/
* Reduce footprint of tox behave docker image
1. Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used.
2. Document that if `restapi.verify_client` is set to `required` then client certificates **must** be provided in the `ctl` section.
3. Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`.
4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.
In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.
Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 | 0 |
+-------------+----------------+---------+---------------------+----+-----------+
$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
```
Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544480,
"replayed_location": 335544480,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "in archive recovery",
"dcs_last_seen": 1688642069,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544816,
"replayed_location": 335544816,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "streaming",
"dcs_last_seen": 1688642089,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
```
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720.
In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`.
P.S. now we do not cover cases when Patroni was killed along with Postgres.
Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
Reorder some checks and verify that the old primary is already in the `running` state before checking replication. This check elliminates the race condition when replication started to work but node name is removed from the `synchronous_standby_names` because state isn't `running`.
When starting check if node with the same is registered in DCS and try to query it's REST API.
If REST API is accessible exit with the error.
Close#1804
Previously, integer gucs, for example `max_connections` would not pass the validation, as these settings have no unit, if and only if they were specified as a string.
This causes problems if the `max_connections` is configured in `patroni.yaml` as a string, for example, the following configuration would not result in the right `max_connections` settings, as `max_connections` is configured as a string:
bootstrap:
dcs:
postgresql:
parameters:
log_checkpoints: "on"
log_connections: "off"
max_connections: "57"
Allowing a user to specify *all* parameters as a string was accepted before in Patroni and also seems very useful, as many of us will be using Ansible/Helm/Golang to build a Patroni configuration, in which creating a `map[string]string` is easier than having to deal with data types.
Attemps to address issue #2735
Regression was introduced in 76b3b99de2
* Reduce flakiness of citus behave tests
- make a few attempts with timeout when checking registered nodes
- get rid from artificial sleep
- allow check_registration() function to check secondaries
These changes are useful for Quorum based failover (#2668) and future PR
that enhances Citus support by registering secondaries in `pg_dist_node`.
Starting from 1.27 there is containerd process, which also uses k3s binary and being detected by pidof. Therefore we will search for "k3s server" string in the process list instead of just "k3s".
Allows for running behave tests with an alternative base image
than the official postgres image.
Also provides a PG_USER/PG_GROUP should that be different to the
default `postgres`.