213 Commits

Author SHA1 Message Date
Polina Bungina
5dbfc9401b Implement kubernetes.bootstrap_labels (#3257)
Allow to define labels that will be assigned to a postgres instance pod when in 'initializing new cluster', 'running custom bootstrap script', 'starting after custom bootstrap', or 'creating replica' state
2025-02-18 09:37:22 +01:00
Alexander Kukushkin
0d87270897 Don't touch logical failover slots (#3245)
If logical replication slot is created with failover => true option, we
get respective field set to true in `pg_replication_slots` view.

By avoiding interacting with such slots we make logical failover slots
feature fully functional in PG17.
2025-02-14 08:35:37 +01:00
Alexander Kukushkin
8de904e556 Improve replication_state=streaming check in behave (#3269)
it was somewhat flaky
2025-02-10 11:04:58 +01:00
Alexander Kukushkin
0bb12473fb Fix bug with slot for former leader not retained on failover (#3261)
the problem existed because _build_retain_slots() method was falsely relying on members being present in DCS, while on failover the member key for the former leader is expiring exactly at the same time.
2025-02-04 13:39:19 +01:00
Alexander Kukushkin
34b2a77294 Fix race condition in priority sync behave tests (#3263)
don't try patching /config key before leader managed to create it.
2025-01-31 16:45:26 +01:00
Polina Bungina
39f5de2e77 Implement sync_priority tag (#3223) 2024-12-10 14:57:47 +01:00
Kian-Meng Ang
4ce0f99cfb Fix typos (#3204)
Found via `codespell -H` and `typos --hidden --format brief`
2024-11-12 10:06:53 +01:00
Polina Bungina
7dcb9b9840 Run on_role_change cb after a failed primary recovery (#3198)
Additionally run on_role_change callback in post_recover() for a primary
that failed to start after a crash to increase chances the callback is executed,
even if the further start as a replica fails

---------

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2024-10-31 09:22:51 +01:00
Alexander Kukushkin
d7e172c20a Don't retains member slots on nodes with nofailover tag (#3169)
Followup on #3142
2024-09-17 11:21:54 +02:00
Alexander Kukushkin
416a0f7c8b Use names with "unusual" symbols in behave tests (#3162)
It'll hopefully prevent problems like #3142 in future.
2024-09-16 09:35:22 +02:00
Alexander Kukushkin
d5d6a51e2c Make sure inactive hot physical replication slots don't hold xmin (#3148)
Since `3.2.0` Patroni is able to create physical replication slots on replica nodes just for the case if this node at some moment will become the primary.
There are two potential problems of having such slots:
1. They prevent recycling of WAL files.
2. They may affect vacuum on the primary is hot_standby_feedback is enabled.

The first class of issues is already addressed by periodically calling pg_replication_slot_advance() function.
However the second class of issues doesn't happen instantly, but only when the old primary switched to a replica. In this case physical replication slots that were at some moment activate will hold NOT NULL value of `xmin`, which will be propagated to the primary via hot_standby_feedback mechanism.

To address the second problem we will detect that a physical replication slot is not supposed to be active, but having NOT NULL `xmin` and drop/crecreate it.

Close #3146
Close #3153

Co-authored-by: Polina Bungina <27892524+hughcapet@users.noreply.github.com>
2024-09-10 08:24:26 +02:00
Alexander Kukushkin
b470ade20e Change master->primary, take two (#3127)
This commit is a breaking change:
1. `role` in DCS is written as "primary" instead of "master".
2. `role` in REST API responses is also written as "primary".
3. REST API no longer accepts role=master in requests (for example switchover/failover/restart endpoints).
4. `/metrics` REST API endpoint will no longer report `patroni_master`.
5. `patronictl` no longer accepts `--master` argument.
6. `no_master` option in declarative configuration of custom replica creation methods is no longer treated as a special option, please use `no_leader` instead.
7. `patroni_wale_restore` doesn't accept `--no_master` anymore.
8. `patroni_barman` doesn't accept `--role=master` anymore.
9. callback scripts will be executed with role=primary instead of role=master
10. On Kubernetes Patroni by default will set role label to primary. In case if you want to keep old behavior and avoid downtime or lengthy complex migrations you can configure `kubernetes.leader_label_value` and `kubernetes.standby_leader_label_value` to `master`.

However, a few exceptions regarding master are still in place:
1. `GET /master` REST API endpoint will continue to work.
2. `master_start_timeout` and `master_stop_timeout` in global configuration are still accepted.
3. `master` tag is still preserved in Consul services in addition to `primary`.

Rationale for these exceptions: DBA doesn't always 100% control the infrastructure and can't adjust the configuration.
2024-08-28 17:19:00 +02:00
Alexander Kukushkin
6d65aa311a Configurable retention of members replication slots (#3108)
Current problem of Patroni that strikes many people is that it removes replication slot for member which key is expired from DCS. As a result, when the replica comes back from a scheduled maintenance WAL segments could be already absent, and it can't continue streaming without pulling files from archive.
With PostgreSQL 16 and newer we get another problem: logical slot on a standby node could be invalidated if physical replication slot on the primary was removed (and `pg_catalog` vacuumed).
The most problematic environment is Kubernetes, where slot is removed nearly instantly when member Pod is deleted.

So far, one of the recommended solutions was to configure permanent physical slots with names that match member names to avoid removal of replication slots. It works, but depending on environment might be non-trivial to implement (when for example members may change their names).

This PR implements support of `member_slots_ttl` global configuration parameter, that controls for how long member replication slots should be kept when the member key is absent. Default value is set to `30min`.
The feature is supported only starting from PostgreSQL 11 and newer, because we want to retain slots not only on the leader node, but on all nodes that could potentially become the new leader, and they should be moved forward using `pg_replication_slot_advance()` function.

One could disable feature and get back to the old behavior by setting `member_slots_ttl` to `0`.
2024-08-23 14:50:36 +02:00
Alexander Kukushkin
93eb4edbe6 Reformat imports with isort (#3123)
Besides that:
1. Introduce `setup.py isort` for quick check
2. Introduce GH actions to check imports
2024-08-13 17:53:59 +02:00
Alexander Kukushkin
0fa41502f1 Register Citus secondaries in pg_dist_node (#2755)
1. All nodes with role == 'replica' and state == 'running' are are registered. In case is state isn't running the node is removed.
2. In case of failover/switchover we always first update the primary
3. When switching to a registered secondary we call citus_update_node() three times: rename primary to primary-demoted, put the primary name to a promoted secondary row and put the promoted secondary name to the primary row

State transitions are produced by the transition() method. First of all the method makes sure that the actual primary is registered in the metadata. In case if for a given group the primary didn't change, the method registers new secondaries and removes secondaries that are gone. It prefers to use citus_update_node() UDF to replace gone secondaries with added.

Communication protocol between primary nodes remains the same and all old features work without any changes.
2024-08-13 09:12:03 +02:00
Alexander Kukushkin
384705ad97 Quorum based failover (#2668)
To enable quorum commit:
```diff
$ patronictl.py edit-config
--- 
+++ 
@@ -5,3 +5,4 @@
   use_pg_rewind: true
 retry_timeout: 10
 ttl: 30
+synchronous_mode: quorum

Apply these changes? [y/N]: y
Configuration changed
```

By default Patroni will use `ANY 1(list,of,stanbys)` in `synchronous_standby_names`. That is, only one node out of listed replicas will be used for quorum.
If you want to increase the number of quorum nodes it is possible to do it with:
```diff
$ patronictl edit-config
--- 
+++ 
@@ -6,3 +6,4 @@
 retry_timeout: 10
 synchronous_mode: quorum
 ttl: 30
+synchronous_node_count: 2

Apply these changes? [y/N]: y
Configuration changed
```

Good old `synchronous_mode: on` is still supported.

Close https://github.com/patroni/patroni/issues/664
Close https://github.com/zalando/patroni/pull/672
2024-08-13 08:51:01 +02:00
Alexander Kukushkin
b1d442e7a4 Advance permanent slots for cascading nodes while in failsafe (#3100)
Lets consider a following replication setup:
```
primary->standby1->standby2(replicatefrom: standby1)
```

In this case the `primary` will not create a physical replication slot for standby2, because it is streaming from the `standby1`.

Things will look differently if we have the following dynamic configuration:
```yaml
slots:
    primary:
        type: physical
    standby1:
        type: physical
    standby2:
        type: physical
```

In this case `primary` will also have `standby2` physical replication slot, which periodically must be advanced. So far it was working by taking value of `xlog_location` from the `/members/standby2` key in DCS.

But, when DCS is down and failsafe mode is activate, the `standby2` physical slot on the `primary` will not not be moved, because there was not way to get the latest value of `xlog_location`.

This PR is addressing the problem by making replica nodes to return their `xlog_location` as `lsn` header in the response on `POST /failsafe` REST API request. The current primary will use these values to advance replication slots for nodes with `replicatefrom` tag.
2024-07-17 16:28:30 +02:00
Polina Bungina
6e1f9f7a6e Prepare repo migration (#3085) 2024-06-17 09:04:43 +02:00
Polina Bungina
14a44e14ba Re-enable SSL for MacOS GH action runners (#3005) 2024-06-12 13:28:01 +02:00
Polina Bungina
ae53260030 Extend behave tests with nostream feature (#3036)
Check state and permanent logical replication slots behaviour
2024-03-29 12:54:40 +01:00
Alexander Kukushkin
2ac1efea54 Optimize priority failover behave tests (#3004)
1. get rid of useless sleep calls
2. call `POST /failover` on the node where we want to failover to
2024-01-15 12:03:14 +01:00
Polina Bungina
71ccf91e36 Don't filter out contradictory nofailover tag (#2992)
* Ensure that nofailover will always be used if both nofailover and
failover_priority tags are provided
* Call _validate_failover_tags from reload_local_configuration() as well
* Properly check values in the _validate_failover_tags(): nofailover value should be casted to boolean like it is done when accessed in other places
2024-01-02 09:30:18 +01:00
Alexander Kukushkin
a4e0a2220d Disable SSL for MacOS GH action runners (#2976)
Latest runners release (20231127.1) somehow broke our tests. Connections to postgres somehow failing with strange error:
```
could not accept SSL connection: Socket operation on non-socket
```
2023-12-06 15:28:03 +01:00
Alexander Kukushkin
7c3ce78231 Fix Citus transaction rollback condition check (#2964)
It seems that sometimes we get an exact match, what makes behave tests to fail.
2023-11-29 08:44:35 +01:00
Israel
bb90feb393 Add support for additional parameters on custom bootstrap (#2927)
Previous to this commit, if a user would ever like to add parameters to the custom bootstrap script call, they would need to configure Patroni like this:

```
bootstrap:
  method: custom_method_name
  custom_method_name:
    command: /path/to/my/custom_script --arg1=value1 --arg2=value2 ...
```

This commit extends that so we achieve a similar behavior that is seen when using `create_replica_methods`, i.e., we also allow the following syntax:

```
bootstrap:
  method: custom_method_name
  custom_method_name:
    command: /path/to/my/custom_script
    arg1: value1
    arg2: value2
```

All keys in the mapping which are not recognized by Patroni, will be dealt with as if they were additional named arguments to be passed down to the `command` call.

References: PAT-218.
2023-10-25 15:01:08 +02:00
Mark Pekala
f5ee67fa1c Feature: failover priority (#2780)
The priority is configured with `failover_priority` tag. Possible values are from `0` till infinity, where `0` means that the node will never become the leader, which is the same as `nofailover` tag set to `true`. As a result, in the configuration file one should set only one of `failover_priority` or `nofailover` tags.

The failover priority kicks in only when there are more than one node have the same receive/replay LSN and are ahead of other nodes in the cluster. In this case the node with higher value of `failover_priority` is preferred. If there is a node with higher values of receive/replay LSN, it will become the new leader even if it has lower value of `failover_priority` (except when priority is set to 0).

Close https://github.com/zalando/patroni/issues/2759
2023-10-24 12:22:48 +02:00
Alexander Kukushkin
c5fffb3c97 Further work on permanent physical slots (#2891)
- Fixed issues with has_permanent_slots() method. It didn't took into account the case of permanent physical slots for members, falsely concluding that there are no permanent slots.
- Write to the status key only LSNs for permanent slots (not just for slots that exist on the primary).
  - Include pg_current_wal_flush_lsn() to slots feedback, so that slots on standby nodes could be advanced
- Improved behave tests:
  - Verify that permanent slots are properly created on standby nodes
  - Verify that permanent slots are properly advanced, including DCS failsafe mode
  - Verify that only permanent slots are written to the `/status`
2023-10-23 08:24:28 +02:00
Alexander Kukushkin
e513f7f127 Attempt to reduce flakiness for recovery behave test on K8s (#2917)
wait until Postgres is properly started after the first crash before changing `primary_start_timeout` and killing it once again.
2023-10-17 11:27:41 +02:00
Alexander Kukushkin
aa3ebe0af8 Don't cache anything in Zookeeper implementation (#2909)
Cache creates a lot of problems and prevents implementing a feature of automatic retention of physical replication slots for members with configurable retention policy.

Just read the entire cluster from Zookeeper instead and use watchers only for the `/leader` and `/config` keys.
2023-10-17 08:56:31 +02:00
Alexander Kukushkin
c96e35c807 Enable Citus behave tests for Postgres v16 (#2914)
and reduce flakiness
2023-10-16 16:05:27 +02:00
Alexander Kukushkin
42976df86f Make it easier to debug callbacks (#2902)
1. Introduce DEBUG logs for callbacks
2. Configure log format in behave tests to include filename, line, and method name that triggered the callback and enable DEBUG logs for `patroni.postgresql.callback_executor` module.

P.S. unfortunately it works only starting from python 3.8, but it should be good enough for debug purpose because 3.7 is already EOL.
2023-10-16 08:55:07 +02:00
Polina Bungina
fb367cd73e Change cb checks in standby cluster behave test (#2899)
fix and extend callback content checks
2023-10-10 13:49:52 +02:00
Polina Bungina
efacc6c16b Ignore synchronous_mode setting in a standby cluster (#2896)
is_synchronous_mode() should always return False in standby clusters
2023-10-06 10:21:37 +02:00
Alexander Kukushkin
f77073c8e1 Speed up dcs failsafe behave tests (#2890)
- get rid from sleeps
- reduce retry_timeout
- avoid graceful Patroni shut down while DCS is "paused", just kill
  Patroni and after that gracefully stop postgres
- don't try to delete Pod when Patroni is killed. If K8s API is paused it takes ages

The run time on my laptop is reduced from 2m to 1m28s.
2023-09-28 10:44:11 +02:00
Alexander Kukushkin
48514db84b Take into account current role when deciding on removal of member ZNode (#2884)
Patroni doesn't watch on all changes of member keys in order to not create too much load on ZooKeeper, but only subscribes to changes (ZNodes added or deleted) in the `/member` directory. Therefore when some important fields in the value are updated we remove and recreate ZNode in order to notify the leader or other members.

The leader should remove the member key only when the `checkpoint_after_promote` value is changed and replicas when the `state` is changed to/from `running`.

We don't care about the `version` field, because Patroni version can't be changed without restart, what will case ZooKeeper `session_id` to change it anyway.

This fix hopefully will reduce failures of behave tests on GH Actions.
2023-09-26 09:12:31 +02:00
Alexander Kukushkin
bc15813de0 Permanent physical slots on standby nodes (#2852)
Create permanent physical replication slots on standby nodes and use `pg_replication_slot_advance()` function to move them forward.

The `restart_lsn` is advanced based on values stored in the `/status` key by the primary node.

When slot is created on a replica it could be ahead the same slot on the primary and therefore there is some period of time when it doesn't protect WAL files from being recycled.
2023-09-20 16:50:37 +02:00
Matt Baker
d2603402ea Debian docker image pip error (#2849)
* Use virtualenv to install tox in behave Dockerfile

Upstream change in postgres docker image uses debian restriction on
installing system-wide non-debian python packages. Debian doesn't
provide a tox>=4, so we need to install with pip.

* Exclude all output directories generated using `tox-wrapper.sh`

The `tox-wrapper.sh` script created by `features/Dockerfile` creates
directories like features/output-tox-pg14-docker-behave-etcd-lin-973719674/

* Reduce footprint of tox behave docker image
2023-09-04 21:24:26 +02:00
Alexander Kukushkin
7e89583ec7 Please new flake8 (#2789)
it stopped liking lack of space character between `,` and `\`
```python
foo,\
    bar
```
2023-07-31 09:08:46 +02:00
Alexander Kukushkin
06db296612 Fixes in patroni.request (#2768)
1.  Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used.
2. Document that if `restapi.verify_client` is set to `required` then client certificates **must** be provided in the `ctl` section.
3.  Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`.
4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.
2023-07-25 08:48:18 +02:00
Alexander Kukushkin
0a8fb0860e Skip flaky scenario when running with Raft (#2771)
Sometimes Patroni doesn't see the latest Raft data on start.
2023-07-21 16:09:34 +02:00
Alexander Kukushkin
d46ca88e6b Make it visible replication state on standbys (#2733)
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.

In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.

Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member      | Host           | Role    | State               | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader  | running             | 12 |           |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 |         0 |
+-------------+----------------+---------+---------------------+----+-----------+

$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader  | running   | 12 |           |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
```

Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
  "role": "replica",
  "server_version": 150003,
  "xlog": {
    "received_location": 335544480,
    "replayed_location": 335544480,
    "replayed_timestamp": null,
    "paused": false
  },
  "timeline": 12,
  "replication_state": "in archive recovery",
  "dcs_last_seen": 1688642069,
  "database_system_identifier": "7252327498286490579",
  "patroni": {
    "version": "3.0.3",
    "scope": "batman"
  }
}

$ curl -s localhost:8009 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
  "role": "replica",
  "server_version": 150003,
  "xlog": {
    "received_location": 335544816,
    "replayed_location": 335544816,
    "replayed_timestamp": null,
    "paused": false
  },
  "timeline": 12,
  "replication_state": "streaming",
  "dcs_last_seen": 1688642089,
  "database_system_identifier": "7252327498286490579",
  "patroni": {
    "version": "3.0.3",
    "scope": "batman"
  }
}
```
2023-07-13 09:24:20 +02:00
Alexander Kukushkin
6e96db173f Start postgres not in recovery in some cases (#2726)
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720.

In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`.

P.S. now we do not cover cases when Patroni was killed along with Postgres.
Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
2023-07-12 09:42:34 +02:00
Alexander Kukushkin
b8cff3515a Reduce flakiness of citus behave tests, take 2 (#2742)
Reorder some checks and verify that the old primary is already in the `running` state before checking replication. This check elliminates the race condition when replication started to work but node name is removed from the `synchronous_standby_names` because state isn't `running`.
2023-07-11 15:04:10 +02:00
Mark Pekala
412c51ddf1 Prevent splitbrain from duplicate names in configuration (#2724)
When starting check if node with the same is registered in DCS and try to query it's REST API.
If REST API is accessible exit with the error.

Close #1804
2023-07-11 07:43:57 +02:00
Feike Steenbergen
4725f12f9a Allow integer gucs without units in validation (#2734)
Previously, integer gucs, for example `max_connections` would not pass the validation, as these settings have no unit, if and only if they were specified as a string.

This causes problems if the `max_connections` is configured in `patroni.yaml` as a string, for example, the following configuration would not result in the right `max_connections` settings, as `max_connections` is configured as a string:

    bootstrap:
      dcs:
        postgresql:
          parameters:
            log_checkpoints: "on"
            log_connections: "off"
            max_connections: "57"

Allowing a user to specify *all* parameters as a string was accepted before in Patroni and also seems very useful, as many of us will be using Ansible/Helm/Golang to build a Patroni configuration, in which creating a `map[string]string` is easier than having to deal with data types.

Attemps to address issue #2735 

Regression was introduced in 76b3b99de2
2023-07-10 13:44:54 +02:00
Alexander Kukushkin
1c36112b44 Reduce flakiness of citus behave tests (#2728)
* Reduce flakiness of citus behave tests

- make a few attempts with timeout  when checking registered nodes
- get rid from artificial sleep
- allow check_registration() function to check secondaries

These changes are useful for Quorum based failover (#2668) and future PR
that enhances Citus support by registering secondaries in `pg_dist_node`.
2023-07-07 15:23:04 +03:00
Alexander Kukushkin
af318b2473 Fix kubernetes behave tests (#2707)
Starting from 1.27 there is containerd process, which also uses k3s binary and being detected by pidof. Therefore we will search for "k3s server" string in the process list instead of just "k3s".
2023-06-01 13:28:29 +02:00
Matt Baker
2158f4a87b Add base image build arg for alt postgres (#2695)
Allows for running behave tests with an alternative base image
than the official postgres image.

Also provides a PG_USER/PG_GROUP should that be different to the
default `postgres`.
2023-05-26 09:35:12 +02:00
Matt Baker
73797e8572 Add tox configuration for running multiple test envs (#2603) 2023-05-24 10:58:04 +02:00
Polina Bungina
ab9fea7d6b Fix openssl certificate generation in behave tests (#2672)
--addext -> -addext (doesn't work on macOS)
set keyfile permissions to 600 (to avoid "private key file has group or world access")
2023-05-12 10:42:53 +02:00