285 Commits

Author SHA1 Message Date
Kian-Meng Ang
4ce0f99cfb Fix typos (#3204)
Found via `codespell -H` and `typos --hidden --format brief`
2024-11-12 10:06:53 +01:00
Alexander Kukushkin
efba02f52e Make sure only supported parameters are written to connection string (#3207)
Close #3206
2024-11-12 09:24:30 +01:00
Polina Bungina
ff278705d6 Partially revert patroni@8c5ab4c (#3180)
Still check against `postgres --describe-config` if a GUC does not have
a validator but is a valid postgres GUC
2024-10-16 11:13:25 +02:00
Alexander Kukushkin
e91e6b5484 Add support of sslnegotiation client-side connection option (#3173)
It is available in PostgreSQL 17

Besides that, enable PG17 in behave tests and include PG17 to supported versions in docs.
2024-09-27 11:27:09 +02:00
Alexander Kukushkin
bfa9b0ca4b Fix flake8 for tests directory (#3168)
Followup on #3123
2024-09-16 17:20:00 +02:00
Alexander Kukushkin
2f800173a5 Handle exception from iterdir while discovering static files (#3152)
Close https://github.com/patroni/patroni/issues/3151
2024-09-09 15:03:20 +02:00
Alexander Kukushkin
835d93951d Add line with localhost to pgpass when unix sockets are detected (#3139)
There are two cases when libpq may search for "localhost":
1. When host in the connection string is not specified and it is using default socket directory path.
2. When specified host matches default socket directory path.

Since we don't know the value of default socket directory path and effectively can't detect the case 2, the best strategy to mitigate the problem would be to add "localhost" if we detected a "host" be a unix socket directory (it starts with '/' character).

Close #3134
2024-08-27 13:39:03 +02:00
Polina Bungina
8c5ab4c07d Improve GUCs validation (#3130)
Due to postgres --describe-config not showing GUCs defined as GUC_NO_SHOW_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE, Patroni was always ignoring some GUCs that a user might want to have configured with non-default values.

- remove postgres --describe-config validation.
- define minor versions for availability bounds of some back-patched GUCs
2024-08-23 14:20:16 +02:00
Alexander Kukushkin
93eb4edbe6 Reformat imports with isort (#3123)
Besides that:
1. Introduce `setup.py isort` for quick check
2. Introduce GH actions to check imports
2024-08-13 17:53:59 +02:00
Alexandre Detiste
dc7ba3fe15 drop dependency on ancient mock (#3074) 2024-06-12 10:47:18 +02:00
Alexander Kukushkin
1ed207cbf0 Compatibility with 17-beta1 (#3076)
- updated list of GUCs
- updated regex for filtering backend processes by name
- `primary_conninfo` will contain `dbname` parameter

The last one is required for synchronizing logical replication slots by slotsync worker and doesn't create problems on older versions.
2024-06-12 10:29:52 +02:00
Alexander Kukushkin
d7454f7bcd Use target_session_attrs only when multiple hosts in standby_cluster (#3040)
Actually comment in the code was already saying that, but on practice it didn't happen.

It should help #3039
2024-04-02 11:59:57 +02:00
Waynerv
ceb2965ab8 Use importlib_resources to read validators file (#3018)
When packaged into pyz (zip file), resources are not directly available on filesystem and therefore we can't always rely on os.listdir() and open() to enumerate and read them.

We are going to use importlib.resources() to solve this problem, except python 3.8 and older, where there is no function (files()) available to enumerate resources. For legacy (3.8 actually becomes EOL in October 2024) python versions we are going to use os.listdir() as a fallback.

Close #3017
2024-04-02 09:00:37 +02:00
Polina Bungina
bdd02324b4 Add pending restart reason information (#2978)
Provide info about the PG parameters that caused "pending restart"
flag to be set. Both `patronictl list` and `/patroni` REST API endpoint
now show the parameters names and the diff as the "pending restart
reason".
2024-02-14 08:54:20 +01:00
Polina Bungina
f6943a859d Improve logging for Pg param change (#3008)
* Convert old value to a human-readable format
* Add log line about pg_controldata/global config mismatch that causes
  pending restart flag to be set
2024-01-29 10:44:25 +01:00
Polina Bungina
266cdc4810 Fixes around pending_restart flag (#3003)
* Do not set pending_restart flag if hot_standby is set to 'off' during a custom bootstrap (even though we will have this flag actually set in PG, this configuration parameter is irrelevant on primary and there is no actual need for restart)
* Skip hot_standby and wal_log_hints when querying parameters pending restart on config reload. They actually can be changed manually (e.g. via ALTER SYSTEM) and it will cause the pending_restart state in PG but Patroni anyway always passes those params to postmaster as command line options. And there they only can have one value - 'on' (except on primary when performing custom bootstrap)
2024-01-16 10:32:28 +01:00
Alexander Kukushkin
5d8c2fb559 Restore recovery GUCs when joining running standby (#2998)
Close https://github.com/zalando/patroni/issues/2993
2024-01-08 08:35:53 +01:00
Polina Bungina
efdedc7049 Reload postgres config if a server param was reset (#2975)
Fix the case when a parameter value was changed and then reset back to
the initial value without restart - before this fix, the second change
was not reflected in the Postgres config.
This commit also includes the related unit test refactoring.
2023-12-06 15:57:05 +01:00
Alexander Kukushkin
193c73f6b8 Make GlobalConfig really global (#2935)
1. extract `GlobalConfig` class to its own module
2. make the module instantiate the `GlobalConfig` object on load and replace sys.modules with the this instance
3. don't pass `GlobalConfig` object around, but use `patroni.global_config` module everywhere.
4. move `ignore_slots_matchers`, `max_timelines_history`,  and `permanent_slots` from `ClusterConfig` to `GlobalConfig`.
5. add `use_slots` property to global_config and remove duplicated code from `Cluster` and `Postgresql.ConfigHandler`.

Besides that improve readability of couple of checks in ha.py and formatting of `/config` key when saved from patronictl.
2023-11-24 09:26:05 +01:00
Alexander Kukushkin
552e8643d9 Verify that replica nodes received checkpoint LSN on shutdown (#2939)
In case if archiving is enabled the `Postgresql.latest_checkpoint_location()` method returns LSN of the prev (SWITCH) record, which points to the beginning of the WAL file. It is done in order to make it possible to safely promote replica which recovers WAL files from the archive and wasn't streaming when the primary was stopped (primary doesn't archive this WAL file).

But, in certain cases using the LSN pointing to SWITCH record was causing unnecessary pg_rewind, if replica didn't managed to replay shutdown checkpoint record before it was promoted.

In order to mitigate the problem we need to check that replica received/replayed exactly the shutdown checkpoint LSN. But, at the same time we will still write LSN of the SWITCH record to the `/status` key when releasing the leader lock.
2023-11-07 11:05:54 +01:00
Alexander Kukushkin
4c1c804cfd Read GUC's values when joining running Postgres (#2876)
If restarted in pause Patroni was discarding `synchronous_standby_names` from `postgresql.conf` because in the internal cache this values was set to `None`. As a result synchronous replication transitioned to a broken state, with no synchronous replicas according to the `synchronous_standby_names` and Patroni not selecting/setting the new synchronous replicas (another bug).

To solve the problem of broken initial state and to avoid similar issues with other GUC's we will read GUC's value if Patroni is joining running Postgres.
2023-09-26 10:40:51 +02:00
Polina Bungina
71863cedcb Always store CMDLINE_OPTIONS config values as int (#2861) 2023-09-14 18:34:45 +02:00
Alexander Kukushkin
30f0f132e8 Don't start stopped postgres in pause (#2848)
Due to a race condition Patroni was falsely assuming that the standby should be restarted because some recovery parameters (primary_conninfo or similar) were changed.

Close https://github.com/zalando/patroni/issues/2834
2023-09-06 08:57:56 +02:00
Alexander Kukushkin
89d794facc Introduce connection pool (#2829)
Make it hold connection kwargs for local connections and all `NamedConnection` objects use them automatically.

Also get rid of redundant `ConfigHandler.local_connect_kwargs`.

On top of that we will introduce a dedicated connection for the REST API thread.
2023-08-24 16:13:22 +02:00
Alexander Kukushkin
366829e379 Refactor Connection class (#2815)
1. stop using the same cursor all the time, it creates problems when not carefully used from different threads.
2. introduce query() method in the Connection class and make it return a result set when it is possible.
3. refactor most of the code that is relying (directly or indirectly) on the Connection object to use the query() method as much as possible.

This refactoring helps with reducing code complexity and will help with future introduction of a separate database connection for the REST API thread. The last one will help to improve reliability when system is under significant stress when simple monitoring queries are taking seconds to execute and the REST API starts blocking the main thread.
2023-08-17 15:42:11 +02:00
Alexander Kukushkin
6a75b1591b Use pg_current_wal_flush_lsn() starting from 9.6 (#2813)
Due to historical reasons (not available before 9.6) we used `pg_current_wal_lsn()`/`pg_current_xlog_location()` functions to get current WAL LSN on the primary. But, this LSN is not necessarily synced to disk, and could be lost if the primary node crashed.
2023-08-15 09:01:37 +02:00
Alexander Kukushkin
efaba9f183 Rename Postgresql.is_leader() to is_primary() (#2809)
It'll help to avoid confusion with the Ha.is_leader() method.
2023-08-09 14:47:53 +02:00
Alexander Kukushkin
84aac437c1 Release v3.1.0 (#2801)
- bump pyright and resolve reported issues
- bump Patroni version
- update release notes
2023-08-03 13:02:29 +02:00
Alexander Kukushkin
01d07f86cd Set permissions for files and directories created in PGDATA (#2781)
Postgres supports two types of permissions:
1. owner only
2. group readable

By default the first one is used because it provides better security. But, sometimes people want to run a backup tool with the user that is different from postgres. In this case the second option becomes very useful. Unfortunately it didn't work correctly because Patroni was creating files with owner access only permissions.

This PR changes the behavior and permissions on files and directories that are created by Patroni will be calculated based on permissions of PGDATA. I.e., they will get group readable access when it is necessary.

Close #1899
Close #1901
2023-08-02 13:15:43 +02:00
Alexander Kukushkin
7e89583ec7 Please new flake8 (#2789)
it stopped liking lack of space character between `,` and `\`
```python
foo,\
    bar
```
2023-07-31 09:08:46 +02:00
Israel
df18885f20 Extend Postgres GUCs validator (#2671)
* Use YAML files to validate Postgres GUCs through Patroni.

Patroni used to have a static list of Postgres GUCs validators in
`patroni.postgresql.validator`.

One problem with that approach, for example, is that it would not
allow GUCs from custom Postgres builds to be validated/accepted.

The idea that we had to work around that issue was to move the
validators from the source code to an external and extendable source.
With that Patroni will start reading the current validators from that
external source plus whatever custom validators are found.

From this commit onwards Patroni will read and parse all YAML files
that are found under the `patroni/postgresql/available_parameters`
directory to build its Postgres GUCs validation rules.

All the details about how this work can be found in the docstring
of the introduced function `_load_postgres_gucs_validators`.
2023-05-31 13:54:54 +02:00
Alexander Kukushkin
66a0e44371 Enable pyright job for every commit (#2675)
And fix remaining issues that the job doesn't fail.
2023-05-15 11:38:40 +02:00
Alexander Kukushkin
76b3b99de2 Enable pyright strict mode (#2652)
- added pyrightconfig.json with typeCheckingMode=strict
- added type hints to all files except api.py
- added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules
- added type stubs for psycopg2 and urllib3 with some little fixes
- fixes most of the issues reported by pyright
- remaining issues will be addressed later, along with enabling CI linting task
2023-05-09 09:38:00 +02:00
Le Duane
bebe6754fc Add before stop hook (#2642)
The two cases we have in mind are:
* In spite of following all best practices client-side, logical replication connections can sometimes hang the Postgres shutdown sequence. We'd like to sigterm any misbehaving logical replication connections which remain after x seconds. These will inevitably get killed anyway on master stop timeout.
* remove "role=master" label on current primary when not using k8s as DCS. Waiting until after Postgres fully stops can sometimes be too long for this.
* Pause pgbouncer connections before switchover

Close #2596
2023-04-27 13:07:32 +02:00
Alexander Kukushkin
2c7b547a29 Introduce patroni.collections (#2629)
For now it implements:
- CaseInsensitiveDict()
- CaseInsensitiveSet()

Update `patroni.postgresql.sync.parse_sync_standby_names()` to use `CaseInsensitiveSet()` instead of `CaseInsensitiveDict()`
2023-04-03 11:19:08 +02:00
Polina Bungina
3fe2a7868a Ignore D401 in flake8-docstrings (#2627)
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2023-04-03 09:52:22 +02:00
Alexander Kukushkin
6f357a4e17 Factor out global configuration into a dedicated class (#2628)
It will help to avoid code duplications.
2023-04-03 08:09:29 +02:00
Alexander Kukushkin
c1bfb0e6d6 Remove python 2.7 support (#2571)
- get rid from 2.7 specific modules: `six`, `ipaddress`
- use Python3 unpacking operator
- use `shutil.which()` instead of `find_executable()`
2023-03-13 17:00:04 +01:00
Alexander Kukushkin
2afcaa9d83 Don't write to PGDATA if major version is not known (#2583)
It could happen that Patroni is started up before PGDATA was mounted. In this case Patroni can't determine major Postgres version from PG_VERSION file. Later, when PGDATA is mounted, Patroni was trying to create the recovery.conf even if the actual Postgres major version is newver than 12.

To mitigate the problem we double check that the `Postgresql._major_version` is set before writing recovery configuration or starting postgres up.

Close https://github.com/zalando/patroni/issues/2434
2023-03-06 16:33:32 +01:00
Alexander Kukushkin
09d0d78b74 Don't allow on_reload callback kill other callbacks (#2578)
Since a long time Patroni enforcing only one callback script running at a time. If the new callback is executed while the old one is still running, the old one is killed (including all child processes).

Such behavior is fine for all callbacks but on_reload, because the last one may accidentally cancel important ones, that for example updating DNS or assigning/removing Virtual IP.

To mitigate the problem we introduce a dedicated executor for on_reload callbacks, so that on_reload may only cancel another on_reload.

Ref: https://github.com/zalando/patroni/issues/2445
2023-03-06 16:33:03 +01:00
Alexander Kukushkin
c985974ece Set hot_standby=off only if recovery_target_action=promote (#2570)
During custom bootstrap the `hot_standby` is set to off to protect postgres from panicking and shutting down when some parameters like `max_connections` are increased on the primary.

According to the [documentation](https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-RECOVERY-TARGET-ACTION), `hot_standby` set to `off` affects behavior of the `recovery_target_action`, and `pause` starts acting as the `shutdown`:
> If [hot_standby](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-HOT-STANDBY) is not enabled, a setting of pause will act the same as shutdown

 This is not what users expect/need, because normally they resolve pause state on their own.

To solve the problem we will set `hot_standby` to `off` during custom bootstrap only if `recovery_target_action` is set to 'promote'.

Close https://github.com/zalando/patroni/issues/2569
2023-02-28 10:08:42 +01:00
Alexander Kukushkin
4c3af2d1a0 Change master->primary/leader/member (#2541)
keep as much backward compatibility as possible.

Following changes were made:
1. All internal checks are performed as `role in ('master', 'primary')`
2. All internal variables/functions/methods are renamed
3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`.
4. Logs are changed to use leader/primary/member/remote depending on the context
5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works.
6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`.
7. `master_(start|stop)_timeout` is automatically translated to `primary_(start|stop)_timeout` if the last one is not set.
8. updated the documentation and some examples

Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether.
The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.
2023-01-27 07:40:24 +01:00
Alexander Kukushkin
3161f31088 Enhanced sync connections check (#2524)
When `synchronous_standby_names` GUC is changed PostgreSQL nearly immediately starts reporting corresponding walsenders as synchronous, while in fact they maybe didn't reach this state yet. To mitigate this problem we memorize current flush lsn on the primary right after change of `synchronous_standby_names` got visible and use it as an additional check for walsenders.
The walsender will be counted as truly "sync" only when write/flush/replay_lsn on it reached memorized LSN and the `application_name` is known to be a part of `synchronous_standby_names`.

The size of PR mostly related to refactoring and moving the code responsible for working with `synchronous_standby_names` and `pg_stat_replication` to the dedicated file.
And `parse_sync_standby_names()` function was mostly copied from #672.
2023-01-24 15:05:54 +01:00
William Albertus Dembo
f06d432dab Keep only latest failed data directory (#2471)
Use constant postfix when moving data directory due to failure so it only keeps data from the latest failure.
2023-01-19 21:47:41 +01:00
Alexander Kukushkin
c12fe4146d Run only one query per HA loop (#2516)
If the cluster is stable (no nodes are joining/leaving/lagging) we want to run at most one monitor query per every HA loop. So far it worker perfectly except when synchronous_mode is enabled, where we run two additional queries:
1. SHOW synchronous_mode
2. SELECT ... FROM pg_stat_replication

In order to solve it, we will include these "queries" to the common monitoring query is synchronous_mode is enabled.

In addition to that make sure that `synchronous_standby_names` is reset on replicas that used to be a primary and avoid using replicas which are not in the 'running' state.

P.S.: in the monitoring query we also extract the current value of synchronous_standby_names, because it will be useful for the quorum commit feature.

Close https://github.com/zalando/patroni/issues/2469
2023-01-10 10:44:17 +01:00
Alexander Kukushkin
92d3e1c167 Introduce the failsafe key in DCS (#2485)
Extracted from #2379
2022-12-13 11:35:06 +01:00
Alexander Kukushkin
580530b30f Behave tests on Windows (#2432)
Windows doesn't support `SIGTERM`, but our behave tests in majority of cases relying on Patroni graceful shutdown.
In order to emulate the behaviour we introduced the new REST API endpoint `POST /sigterm`. The endpoint works only on Windows and when `BEHAVE_DEBUG` environment variable is set.
Besides that some minor adjustments in behave tests were done. Mainly related to backslash-slash handling.

In addition to that improve test coverage on Windows by properly mocking access to filesystem and avoiding calling
 `subprocess.call()`. Specifically, symlink creation on Windows requires Admin privileges and there is no `true.exe`.
2022-10-21 12:24:24 +02:00
Alexander Kukushkin
5b1fd23776 Always return checkpoint location as integer (#2349)
before it was also returning a str in some cases
2022-06-30 10:52:28 +02:00
Alexander Kukushkin
96b75fa7cb Special handling of check_recovery_conf for v12+ (#2292)
When starting as a replica it may take some time before Postgres starts accepting new connections, but meanwhile, it could happen that the leader transitioned to a different member and the `primary_conninfo` must be updated.

On pre v12 Patroni regularly checks `recovery.conf` in order to check that recovery parameters match the expectation. Starting from v12 recovery parameters were converted to GUC's and Patroni gets current values from the `pg_settings` view. The last one creates a problem when it takes more than a minute for Postgres to start accepting new connections.

Since Patroni attempts to execute at least `pg_is_in_recovery()` every HA loop, and it is raising at exception, the `check_recovery_conf()` effectively wasn't reachable until recovery is finished, but it changed when #2082 was introduced.

As a result of #2082 we got the following behavior:
1. Up to v12 (not including) everything was working as expected
2. v12 and v13 - Patroni restarting Postgres after 1m of recovery
3. v14+ - the `check_recovery_conf()` is not executed because the `replay_paused()` method raising an exception.

In order to properly handle changes of recovery parameters or leader transitioned to a different node on v12+, we will rely on the cached values of recovery parameters until Postgres becomes ready to execute queries.

Close https://github.com/zalando/patroni/issues/2289
2022-05-12 07:45:49 +02:00
Michael Banck
2d15e0dae6 Add target_session_attrs=read-write to standby_leader primary_conninfo (#2193)
This allows to have multiple hosts in a standby_cluster and ensures that the standby leader follows the main cluster's new leader after a switchover.

Partially addresses #2189
2022-02-10 15:50:14 +01:00