2505 Commits

Author SHA1 Message Date
Polina Bungina
c6943dc415 Implement --print option for --validate-config (#3296) 2025-02-28 14:49:11 +01:00
Alexander Kukushkin
36011e936a Update config files on SIGHUP (#3299)
Currently Patroni replaces config files only if it detected a change in global configuration + patroni.yaml, however it could be that configs on filesystem were updated by humans and we want to "restore" them.
2025-02-28 11:42:45 +01:00
Alexander Kukushkin
a316105412 Fix bug with priority failover (#3297)
We should ignor the former leader with higher priority when it reports the same LSN as the current node.

This bug could be a contributing factor to issues described in #3295


In addition to that mock socket.getaddrinfo() call in test_api.py to avoid hitting DNS servers.
2025-02-28 09:48:16 +01:00
Garaz08
92c4f9fbb5 Solve a couple of Flaky unit tests (#3294) 2025-02-25 15:39:46 +01:00
Sophia Ruan
1c5d9f5653 fix typo: update recovery_target_timeline to recovery_target_action (#3292)
In the Bootstrap doc description, there is a typo recovery_target_timeline in recovery_conf block, which should be recovery_target_action.
2025-02-24 13:52:32 +01:00
Polina Bungina
66cf21767d Release v4.0.5 (#3286)
- Icrease version
- Add RNs
- Update year in the copyright
2025-02-20 16:29:23 +01:00
Alexander Kukushkin
b573bd4c9d Compatibility with python 3.6 (#3287)
time.time_ns() is not available
2025-02-20 15:18:52 +01:00
Polina Bungina
33600976b1 Re-apply "Enable behave tests with Citus 13 and PostgreSQL 17" (#3285)
This reverts commit 3d932e1e73.
2025-02-20 11:58:58 +01:00
Alexander Kukushkin
e9ba775959 Fix a couple of bugs in quorum state machine (#3278)
1. when evaluating whether there are healthy nodes for a leader race before demoting we need to take into account quorum requirements. Without it the former leader may end up in recovery surrounded by asynchronous nodes.
2. QuorumStateResolver wasn't correctly handling the case when the replica node quickly joined and disconnected, what was resulting in the following errors:
```
  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 427, in _generate_transitions
    yield from self.__remove_gone_nodes()
  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 327, in __remove_gone_nodes
    yield from self.sync_update(numsync, sync)
  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 227, in sync_update
    raise QuorumError(f'Sync {numsync} > N of ({sync})')
patroni.quorum.QuorumError: Sync 2 > N of ({'postgresql2'})
2025-02-14 10:18:07,058 INFO: Unexpected exception raised, please report it as a BUG

  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 246, in __iter__
    transitions = list(self._generate_transitions())
  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 423, in _generate_transitions
    yield from self.__handle_non_steady_cases()
  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 281, in __handle_non_steady_cases
    yield from self.quorum_update(len(voters) - self.numsync, voters)
  File "/home/akukushkin/git/patroni/patroni/quorum.py", line 184, in quorum_update
    raise QuorumError(f'Quorum {quorum} < 0 of ({voters})')
patroni.quorum.QuorumError: Quorum -1 < 0 of ({'postgresql1'})
2025-02-18 15:50:48,243 INFO: Unexpected exception raised, please report it as a BUG
```
2025-02-20 11:00:22 +01:00
Alexander Kukushkin
cf427e8b0b Bump pyright to 1.1.394 (#3283) 2025-02-19 17:04:19 +01:00
Polina Bungina
7531d41587 Pin sphinx to <8.2.0 (#3284) 2025-02-19 16:34:00 +01:00
Polina Bungina
5dbfc9401b Implement kubernetes.bootstrap_labels (#3257)
Allow to define labels that will be assigned to a postgres instance pod when in 'initializing new cluster', 'running custom bootstrap script', 'starting after custom bootstrap', or 'creating replica' state
2025-02-18 09:37:22 +01:00
Alexander Kukushkin
ce79152088 Take advantage of written_lsn and latest_end_lsn from pg_stat_wal_receiver (#3268)
The first one if available starting from PostgreSQL v13 and contains the
real write LSN. We will prefer it over value returned by
pg_last_wal_receive_lsn(), which is in fact flush LSN.

The second one is available starting from PostgreSQL v9.6 and  points to
WAL flush on the source host. In case of primary it will allow to better
calculate the replay lag, because values stored in DCS are updated only
every loop_wait seconds.
2025-02-17 15:06:36 +01:00
Alexander Kukushkin
6920b3af0e Cleanup after unit tests (#3277)
Close https://github.com/patroni/patroni/issues/3276
2025-02-14 13:29:34 +01:00
Alexander Kukushkin
0d87270897 Don't touch logical failover slots (#3245)
If logical replication slot is created with failover => true option, we
get respective field set to true in `pg_replication_slots` view.

By avoiding interacting with such slots we make logical failover slots
feature fully functional in PG17.
2025-02-14 08:35:37 +01:00
Alexander Kukushkin
1a31ea6e20 Compatibility with latest changes in urlparse (#3275)
It doesn't accept multiple hosts with [] character in URL anymore.
To mitigate the problem we switch to native wrappers of
PQconninfoParse() function from libpq when it is possible and use own
implementation only when psycopg2 is too old.
2025-02-13 16:07:51 +01:00
Alexander Kukushkin
8de904e556 Improve replication_state=streaming check in behave (#3269)
it was somewhat flaky
2025-02-10 11:04:58 +01:00
Michael Morris
c97ad83396 Add configuration option to suppress duplicate heartbeat logs (#3252)
Close #3251
2025-02-04 16:25:08 +01:00
Alexander Kukushkin
0bb12473fb Fix bug with slot for former leader not retained on failover (#3261)
the problem existed because _build_retain_slots() method was falsely relying on members being present in DCS, while on failover the member key for the former leader is expiring exactly at the same time.
2025-02-04 13:39:19 +01:00
Alexander Kukushkin
302757b71a Handle all exceptions raised by subprocess in controldata() method (#3267)
Close #3264
2025-02-04 13:38:59 +01:00
Polina Bungina
3d932e1e73 Temp revert of "Enable behave tests with Citus 13 and PostgreSQL 17" (#3265)
but keep timeout increase
2025-02-03 08:44:02 +01:00
Alexander Kukushkin
38aef484e8 Fix a few little issues with 9.5 support (#3260)
1. pg_rewind error log format wasn't verbose
2. it doesn't support specifying num in synchronous_standby_names
2025-01-31 16:46:07 +01:00
Alexander Kukushkin
34b2a77294 Fix race condition in priority sync behave tests (#3263)
don't try patching /config key before leader managed to create it.
2025-01-31 16:45:26 +01:00
Alexander Kukushkin
6caa2fa99c Enable behave tests with Citus 13 and PostgreSQL 17 (#3262)
Also increase timeout from 15m to 20m
2025-01-31 16:44:32 +01:00
Joe Jensen
b4eab48971 Fall through to default behavior when pyinstall toc is not found (#3256)
Close #3255
2025-01-31 10:14:27 +01:00
Alexander Kukushkin
2bc25a32e4 Avoid dropping physical slots too early (#3244)
Consider a situation: there is a permanent logical slot and primary and replica are temporary down.
When Patroni is started on the former primary it starts Postgres in a standby mode, what leads to removal of physical replication slot for the replica because it has xmin.

We should postpone removal of such physical slots:
- on replica until there will be a leader in the cluster
- on primary until Postgres is promoted
2025-01-30 13:08:30 +01:00
Alexander Kukushkin
7db7dfd3c5 Compatibility with python 3.13 (#3246)
- fix unit tests (logging now uses time.time_ns() instead of time.time())
- update setup.py
- update tox.ini
- enable unix and behave tests with 3.13

Close https://github.com/patroni/patroni/issues/3243
2025-01-20 08:58:12 +01:00
Antoni Mur
3938bb9a16 Replace forward slash in cluster_name (#3247) 2025-01-20 08:57:48 +01:00
Julian
26ae38960a Improve error on empty or non dict config file (#3238)
Test if config (file) parsed with yaml_load() contains a valid Mapping
object, otherwise Patroni throws an explicit exception. It also makes
the Patroni output more explicit when using that kind of "invalid"
configuration.

``` console
$ touch /tmp/patroni.yaml
$ patroni --validate-config /tmp/patroni.yaml
/tmp/patroni.yaml does not contain a dict
invalid config file /tmp/patroni.yaml
```
reportUnnecessaryIsInstance is explicitly ignored since we can't
determine what yaml_safeload can bring from a YAML config (list,
dict,...).
2025-01-17 14:44:47 +01:00
Alexander Kukushkin
836e527e6d Fix deps compatibility, increase tests coverage i(#3233)
* Compatibility with python-json-logger>=3.1

After refactoring the old API is still working, but producing warnings
and pyright also fails.

Besides that improve coverage of watchdog/base.py and ctl.py

* Stick to ubuntu 22.04

* Please pyright
2024-12-24 09:11:17 +01:00
Alexander Kukushkin
e73f2044c8 Cancel long-running jobs on Patroni stop (#3232)
Patroni could be doing replica bootstrap and we don't want want pg_basebackup/wal-g/pgBackRest/barman or similar keep running.

Besides that, remove data directory on replica bootstrap failure if configuration allows.

Close #3224
2024-12-12 09:52:03 +01:00
Polina Bungina
39f5de2e77 Implement sync_priority tag (#3223) 2024-12-10 14:57:47 +01:00
avandras
46e20edbc2 Show only the members to be restarted upon restart confirmation (#3226)
When doing `patronictl restart <clustername> --pending`, the confirmation lists all members, regardless if their restart is really pending:

```
> patronictl restart pgcluster --pending
+ Cluster: pgcluster (7436691039717365672) ----+----+-----------+-----------------+---------------------------------+
| Member | Host     | Role         | State     | TL | Lag in MB | Pending restart | Pending restart reason          |
+--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+
| win1   | 10.0.0.2 | Sync Standby | streaming |  8 |         0 | *               | hba_file: [hidden - too long]   |
|        |          |              |           |    |           |                 | ident_file: [hidden - too long] |
|        |          |              |           |    |           |                 | max_connections: 201->202       |
+--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+
| win2   | 10.0.0.3 | Leader       | running   |  8 |           | *               | hba_file: [hidden - too long]   |
|        |          |              |           |    |           |                 | ident_file: [hidden - too long] |
|        |          |              |           |    |           |                 | max_connections: 201->202       |
+--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+
| win3   | 10.0.0.4 | Replica      | streaming |  8 |         0 |                 |                                 |
+--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+
When should the restart take place (e.g. 2024-11-27T08:27)  [now]:
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Are you sure you want to restart members win1, win2, win3? [y/N]:
```

When we proceed with the restart despite the scary message mentioning all members, not just the ones needing a restart, there will be an error message stating the node not to be restarted was indeed not restarted:

```
Are you sure you want to restart members win1, win2, win3? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Success: restart on member win1
Success: restart on member win2
Failed: restart for member win3, status code=503, (restart conditions are not satisfied)
```

The misleading confirmation message can also be seen when using the `--any` flag.

The current PR is fixing this.

However, we do not apply filtering in case of scheduled pending restart, because the condition must be evaluated at the scheduled time.
2024-12-10 12:04:47 +01:00
Michael Banck
578dc39291 Add optional 'cluster_type' attribute to permanent replication slots. (#3229)
This allows to set whether a particular permanent replication slot should always be created ('cluster_type=any', the default), or just on a primary ('cluster_type=primary') or standby ('cluster_type=standby') cluster, respectively.
2024-12-10 11:55:59 +01:00
Ants Aasma
9d1609e0eb Reduce log level of watchdog configuration failure (#3231)
When in automatic mode we probably don't need to warn user about failure to set up watchdog. This is the common case and makes many users think that this feature is somehow necessary to run Patroni safely. For most users it is completely fine to run without and it makes sense to reduce their log spam.
2024-12-10 11:54:27 +01:00
Polina Bungina
fb0fcc859a Release v4.0.4 (#3221)
* Release v4.0.4

- Increase version
- Use latest pyright
- Add RNs
2024-11-22 14:29:59 +01:00
Alexander Kukushkin
a903438a5a Compatibility with ydiff==1.4.2 (#3216)
1. Implemented compatibility.
2. Constrained the upper version in requirements.txt to avoid future failures.
3. Setup an additional pipeline to check with the latest ydiff.

Close #3209
Close #3212
Close #3218
2024-11-19 09:27:49 +01:00
Alexander Kukushkin
19f75b407e Compatibility with prettytable>=3.12.0 (#3217)
They started showing deprecation warning when importing ALL and FRAME constants.
2024-11-19 09:09:09 +01:00
Alexander Kukushkin
3f00b7a6c7 Restore compatibility with python-consul2 (#3215)
It was broken in #3191
2024-11-19 09:08:50 +01:00
Kian-Meng Ang
4ce0f99cfb Fix typos (#3204)
Found via `codespell -H` and `typos --hidden --format brief`
2024-11-12 10:06:53 +01:00
Alexander Kukushkin
efba02f52e Make sure only supported parameters are written to connection string (#3207)
Close #3206
2024-11-12 09:24:30 +01:00
Alexander Kukushkin
e1faa38e90 Cache DCS instances to avoid thread leak in patronictl list -W (#3205)
Close #3202
2024-11-11 13:59:27 +01:00
bocytko
177101a1cc Fixes outdated link to Zalando's tech blog on Patroni (#3201) 2024-11-05 09:44:27 +01:00
Polina Bungina
7dcb9b9840 Run on_role_change cb after a failed primary recovery (#3198)
Additionally run on_role_change callback in post_recover() for a primary
that failed to start after a crash to increase chances the callback is executed,
even if the further start as a replica fails

---------

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2024-10-31 09:22:51 +01:00
Alexander Kukushkin
e8a8bfe42f Switch to py-consul (#3191)
python-consul is unmaintained for a long time and py-consul is an official replacement.
However, we still keep backward compatibility with python-consul.

Close: #3189
2024-10-28 09:58:57 +01:00
Denis Laxalde
72be036c99 Fix defaults 'max_wal_senders' and 'max_replication_slots' in docs (#3192)
From the actual code, in patroni/postgresql/config.py::ConfigHandler.CMDLINE_OPTIONS,
the previous defaults were wrong.
2024-10-25 11:18:45 +02:00
Polina Bungina
969d7ec4ab Increase version, add RNs (#3188) 2024-10-18 13:42:42 +02:00
Polina Bungina
75ff8b3256 Add documentation for sslnegotiation option (#3185) 2024-10-18 09:27:19 +02:00
Alexander Kukushkin
4853b3b430 Pyright 1.1.385 (#3182)
Declaring variables with `Union` and using `isinstance()` hack doesn't work anymore. Therefore the code is updated to use `Any` for variable and `cast` function after firguring out the correct type in order to avoid getting errors about `Unknown` types.
2024-10-18 09:24:51 +02:00
Polina Bungina
ba970d8c63 Temporary pin psycopg2-binary version for macOS (#3186) 2024-10-18 08:44:28 +02:00