Commit Graph

942 Commits

Author SHA1 Message Date
Alexander Kukushkin
bda07fa526 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2024-04-02 12:10:17 +02:00
Alexander Kukushkin
d7454f7bcd Use target_session_attrs only when multiple hosts in standby_cluster (#3040)
Actually comment in the code was already saying that, but on practice it didn't happen.

It should help #3039
2024-04-02 11:59:57 +02:00
Waynerv
ceb2965ab8 Use importlib_resources to read validators file (#3018)
When packaged into pyz (zip file), resources are not directly available on filesystem and therefore we can't always rely on os.listdir() and open() to enumerate and read them.

We are going to use importlib.resources() to solve this problem, except python 3.8 and older, where there is no function (files()) available to enumerate resources. For legacy (3.8 actually becomes EOL in October 2024) python versions we are going to use os.listdir() as a fallback.

Close #3017
2024-04-02 09:00:37 +02:00
Polina Bungina
9b237b332e Set global_config from dynamic_config if DCS data is empty (#3038)
Fix the oversight of 193c73f
We need to set global config from the local cache if cluster.config is not initialized.
If there is nothing written into the DCS (yet), we need the setup info for the decision making (e.g., if it is a standby cluster)
2024-03-28 08:15:58 +01:00
Grigory Smolkin
b09af642e6 Disable WAL streaming on standby node via new boolean tag "nostream" (#2842)
Add support for ``nostream`` tag. If set to ``true`` the node will not use replication protocol to stream WAL. It will rely instead on archive recovery (if ``restore_command`` is configured) and ``pg_wal``/``pg_xlog`` polling. It also disables copying and synchronization of permanent logical replication slots on the node itself and all its cascading replicas. Setting this tag on primary node has no effect.
2024-03-20 10:10:53 +01:00
Israel
014777b20a Refactor Barman scripts and add a sub-command to switch Barman config (#3016)
We currently have a script named `patroni_barman_recover` in Patroni, which is intended to be used as a custom bootstrap method, or as a custom replica creation method.

Now there is need of one more Barman related script in Patroni to handle switching of config models in Barman upon `on_role_change` events.

However, instead of creating another Patroni script, let's say `patroni_barman_config_switch`, and duplicating a lot of logic in the code, we decided to refactor the code so:

* Instead of two separate scripts (`patroni_barman_recover` and `patroni_barman_config_switch`), we have a single script (`patroni_barman`) with 2 sub-commands (`recover` and `config-switch`)

This is the overview of changes that have been performed:

* File `patroni.scripts.barman_recover` has been removed, and its logic has been split into a few files:
  * `patroni.scripts.barman.cli`: handles the entrypoint of the new `patroni_barman` command, exposing the argument parser and calling the appropriate functions depending on the sub-command
  * `patroni.scripts.barman.utils`: implements utilitary enums, functions and classes wich can be used by `cli` and by sub-commands implementation:
    * retry mechanism
    * logging set up
    * communication with pg-backup-api
  * `patroni.scripts.barman.recover`: implements the `recover` sub-command only
* File `patroni.tests.test_barman_recover` has been renamed as `patroni.tests.test_barman`
* File `patroni.scripts.barman.config_switch` was created to implement the `config-switch` sub-command only
* `setup.py` has been changed so it generates a `patroni_barman` application instead of `patroni_barman_recover`
* Docs and unit tests were updated accordingly

References: PAT-154.
2024-03-20 09:04:55 +01:00
Alexander Kukushkin
a8cfd46801 Retry one time on Etcd3 auth error (#3026)
But do it only in case if we didn't authenticate right before executing a request. Previously retries only happened when the caller was executed with `Retry.__call__()`, which is not the case for methods like `set_failover_value()` or `set_config_value()`. Also, it seems that existing watchers aren't affected, therefore we will not restart them after reauthentication.

In addition to that fix issues with `Retry.ensure_deadline(0)`:
1. the return value was ignored
2. we don't have to set `Retry.deadline` attr, it is not used anywhere

Close https://github.com/zalando/patroni/issues/3023
2024-03-07 12:01:35 +01:00
zhjwpku
e131065d74 rename citus_handler to mpp_handler (#2991)
obey the following 5 meanings of terminology _cluster_ in Patroni.

1. PostgreSQL cluster: a cluster of postgresql instances which have the same system identifier.
2. MPP cluster: a cluster of PostgreSQL clusters that one of them acts as Coodinator and others act as workers.
3. Coordinator cluster: a PostgreSQL cluster which act the role of 'coordinator' within a MPP cluster.
4. Worker cluster: a PostgreSQL cluster which act the role 'worker' within a MPP cluster.
5. Patroni cluster: all cluster managed by Patroni can be called Patroni cluster, but we usually use this term to refering a single PostgreSQL cluster or an MPP cluster.
2024-02-28 06:16:20 +01:00
Polina Bungina
bdd02324b4 Add pending restart reason information (#2978)
Provide info about the PG parameters that caused "pending restart"
flag to be set. Both `patronictl list` and `/patroni` REST API endpoint
now show the parameters names and the diff as the "pending restart
reason".
2024-02-14 08:54:20 +01:00
Israel
7adfc0dbe7 Patroni doesn't filter out some not allowed options from pg_basebackup (#3015)
When running `pg_basebackup` to bootstrap a replica, Patroni sanitizes
the custom user options that come from `postgresql.basebackup` configuration
section using the `process_user_options` method.

However, there is a bug in that method: it filters out not allowed options
that are in the format `- setting`, but not the ones in the format
`- setting: value` from `postgresql.basebackup`.

An example of that issue is the `dbname` setting. If you specify something
like this in the configuration file:

```yaml
postgresql:
  basebackup:
    - dbname: "host=RANDOM"
```

You end up with `--dbname` being specified twice for `pg_basebackup`, with
`--dbname='host=RANDOM'` taking precedence as it comes up later in the
command.

This commit fixes that issue by adding a `continue` statement when
the setting in format `- setting: value` is not allowed, thus skipping
it.

---------

Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
2024-02-06 08:36:11 +01:00
Polina Bungina
f6943a859d Improve logging for Pg param change (#3008)
* Convert old value to a human-readable format
* Add log line about pg_controldata/global config mismatch that causes
  pending restart flag to be set
2024-01-29 10:44:25 +01:00
Alexander Kukushkin
e532f9dc38 Fix bugs introduced in the jsonlog implementation (#3006)
1. RotatingFileHandler is a child of StreamHandler, therefore we can't rely on `not isinstance(handler, logging.StreamHandler)`.
2. If the legacy version of `python-json-logger` is installed (that doesn't support rename_fields or static_fields), we want do what is possible rather than fail with the exception.

Besides that:
1. improve code coverage
2. make unit tests pass without python-json-logger installed or if only some old version is installed.
2024-01-29 10:37:15 +01:00
Alexander Kukushkin
688c85389c Release v3.2.2 (#3007)
- update release notes
- bump Patroni version
- bump pyright version and fix reported issues
- improve compatibility with legacy psycopg2

Co-authored-by: Polina Bungina <bungina@gmail.com>
2024-01-17 08:31:08 +01:00
علی سالمی
5c4ee30dae Add JSON log format to logging configuration (#2982)
Now patroni can be configured as bellow to log in json format.

```yaml
log:
  type: json
  format:
    - asctime: '@timestamp'
    - levelname: level
    - message
    - module
    - name: logger_name
  static_fields:
    app: patroni
```

This config produce this log:

```json
{
  "@timestamp": "2023-12-14 19:51:24,872",
  "level": "INFO",
  "message": "Lock owner: None; I am postgresql1",
  "module": "ha",
  "app": "patroni",
  "logger_name": "patroni.ha"
}
```
2024-01-16 10:42:48 +01:00
Polina Bungina
266cdc4810 Fixes around pending_restart flag (#3003)
* Do not set pending_restart flag if hot_standby is set to 'off' during a custom bootstrap (even though we will have this flag actually set in PG, this configuration parameter is irrelevant on primary and there is no actual need for restart)
* Skip hot_standby and wal_log_hints when querying parameters pending restart on config reload. They actually can be changed manually (e.g. via ALTER SYSTEM) and it will cause the pending_restart state in PG but Patroni anyway always passes those params to postmaster as command line options. And there they only can have one value - 'on' (except on primary when performing custom bootstrap)
2024-01-16 10:32:28 +01:00
Alexander Kukushkin
5d8c2fb559 Restore recovery GUCs when joining running standby (#2998)
Close https://github.com/zalando/patroni/issues/2993
2024-01-08 08:35:53 +01:00
Alexander Kukushkin
59ecfb1799 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2024-01-05 10:22:35 +01:00
Polina Bungina
71ccf91e36 Don't filter out contradictory nofailover tag (#2992)
* Ensure that nofailover will always be used if both nofailover and
failover_priority tags are provided
* Call _validate_failover_tags from reload_local_configuration() as well
* Properly check values in the _validate_failover_tags(): nofailover value should be casted to boolean like it is done when accessed in other places
2024-01-02 09:30:18 +01:00
zhjwpku
8acefefc42 Fix Citus bootstrap - CREATE DATABASE cannot be executed from a function (#2994)
This was introduced by #2990: pod cannot be started and show the
following logs:

```
2023-12-26 03:29:25.569 UTC [47] CONTEXT:  SQL statement "CREATE DATABASE "citus""
        PL/pgSQL function inline_code_block line 5 at SQL statement
2023-12-26 03:29:25.569 UTC [47] STATEMENT:  DO $$
        BEGIN
            PERFORM * FROM pg_catalog.pg_database WHERE datname = 'citus';
            IF NOT FOUND THEN
                CREATE DATABASE "citus";
            END IF;
        END;$$
2023-12-26 03:29:25,570 ERROR: post_bootstrap
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/patroni/postgresql/bootstrap.py", line 474, in post_bootstrap
    self._postgresql.citus_handler.bootstrap()
  File "/usr/local/lib/python3.11/dist-packages/patroni/postgresql/mpp/citus.py", line 401, in bootstrap
    cur.execute(sql.encode('utf-8'))
psycopg2.errors.ActiveSqlTransaction: CREATE DATABASE cannot be executed from a function
CONTEXT:  SQL statement "CREATE DATABASE "citus""
PL/pgSQL function inline_code_block line 5 at SQL statement
```
---------

Signed-off-by: Zhao Junwang <zhjwpku@gmail.com>
2023-12-29 09:01:46 +01:00
Alexander Kukushkin
bcfd8438a5 Abstract CitusHandler and decouple it from configuration (#2950)
the main issue was that the configuration for Citus handler and for DCS existed in two places, while ideally AbstractDCS should not know many details about what kind of MPP is in use.

To solve the problem we first dynamically create an object implementing AbstractMPP interfaces, which is a configuration for DCS. Later this object is used to instantiate the class implementing AbstractMPPHandler interface.

This is just a starting point, which does some heavy lifting. As a next steps all kind of variables named after Citus in files different from patroni/postgres/mpp/citus.py should be renamed.

In other words this commit takes over the most complex part of #2940, which was never implemented.

Co-authored-by: zhjwpku <zhjwpku@gmail.com>
2023-12-21 08:58:26 +01:00
Alexander Kukushkin
5c3e1a693e Implement validation of the log section (#2989)
Somehow it was always forgotten.
2023-12-20 10:49:33 +01:00
Polina Bungina
206ee91b07 Exclude leader from failover candidates in ctl (#2983)
Exclude actual leader (not the passed leader argument) from the
candidates list in the `patronictl failover` prompt.
Abort `patronictl failover` execution if candidate specified is
the same as the current cluster leader
2023-12-20 09:54:04 +01:00
Polina Bungina
f0719d148c Actually allow failover to an async candidate in sync mode (#2980) 2023-12-13 08:40:47 +01:00
Polina Bungina
efdedc7049 Reload postgres config if a server param was reset (#2975)
Fix the case when a parameter value was changed and then reset back to
the initial value without restart - before this fix, the second change
was not reflected in the Postgres config.
This commit also includes the related unit test refactoring.
2023-12-06 15:57:05 +01:00
Alexander Kukushkin
0e6a2ff3a9 Don't let replica restore initialize key when DCS was wiped (#2970)
It was happening from the branch where Patroni was supposed to be complain about converting standalone PG cluster to be governed by Patroni and exit.
2023-12-05 08:30:20 +01:00
Alexander Kukushkin
91a6059055 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-11-29 14:34:56 +01:00
Alexander Kukushkin
92f4aa2ef9 Simplify methods related to replication slots in the Cluster class (#2958)
Instead of passing around names, specific tags, and Postgres version just pass Postgresql object and objects implementing Tags interface.

It should simplify implementation of #2842
2023-11-29 14:22:49 +01:00
Alexander Kukushkin
9afaf6eb51 Don't pass around is_paused to sync_replication_slots (#2963)
Oversight of #2935
2023-11-28 08:37:22 +01:00
Alexander Kukushkin
13cc86f851 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-11-24 14:42:15 +01:00
Alexander Kukushkin
193c73f6b8 Make GlobalConfig really global (#2935)
1. extract `GlobalConfig` class to its own module
2. make the module instantiate the `GlobalConfig` object on load and replace sys.modules with the this instance
3. don't pass `GlobalConfig` object around, but use `patroni.global_config` module everywhere.
4. move `ignore_slots_matchers`, `max_timelines_history`,  and `permanent_slots` from `ClusterConfig` to `GlobalConfig`.
5. add `use_slots` property to global_config and remove duplicated code from `Cluster` and `Postgresql.ConfigHandler`.

Besides that improve readability of couple of checks in ha.py and formatting of `/config` key when saved from patronictl.
2023-11-24 09:26:05 +01:00
Alexander Kukushkin
91327f943c Factor out dynamic class finder/loader to a dedicated file (#2954)
It could be reused to do the same for MPP modules/classes.
Ref: #2940 and #2950
2023-11-23 17:04:23 +01:00
Alexander Kukushkin
5dab735534 Compatibility with antient mock (#2951)
Just in case is someone still uses ubuntu 18.04
2023-11-15 11:25:46 +01:00
Alexander Kukushkin
ecf158bce3 Get rid of pass_obj() in most of patronictl commands (#2945)
The `obj` could be easily obtained with the help of `click.get_current_context().obj`.

Introduced function `is_citus_cluster()` will simplify future refactoring to add support of other MPP databases.

In addition to that refactor ctl.py unit tests by moving most of mocks to the global scope.,
2023-11-14 13:44:54 +01:00
Alexander Kukushkin
1870dcd8f9 Fix bug with custom bootstrap (#2948)
Patroni was falsely applying `--command` argument.

Close https://github.com/zalando/patroni/issues/2947
2023-11-13 15:01:57 +01:00
Alexander Kukushkin
7370f70f13 Fix pg_rewind behavior with Postgres v16+ (#2944)
The error message format was changed in
4ac30ba4f2, what caused `pg_rewind` being called by Patroni even when it was not necessary.
2023-11-10 09:23:45 +01:00
Alexander Kukushkin
1b96ae9c0a Fix Etcd v2 with Citus (#2943)
When deploying a new Citus cluster with Etcd v2 Patroni was failing to start with the following exception:
```python
2023-11-09 10:51:41,246 INFO: Selected new etcd server http://localhost:2379
Traceback (most recent call last):
  File "/home/akukushkin/git/patroni/./patroni.py", line 6, in <module>
    main()
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 343, in main
    return patroni_main(args.configfile)
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 237, in patroni_main
    abstract_main(Patroni, configfile)
  File "/home/akukushkin/git/patroni/patroni/daemon.py", line 172, in abstract_main
    controller = cls(config)
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 66, in __init__
    self.ensure_unique_name()
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 112, in ensure_unique_name
    cluster = self.dcs.get_cluster()
  File "/home/akukushkin/git/patroni/patroni/dcs/__init__.py", line 1654, in get_cluster
    cluster = self._get_citus_cluster() if self.is_citus_coordinator() else self.__get_patroni_cluster()
  File "/home/akukushkin/git/patroni/patroni/dcs/__init__.py", line 1638, in _get_citus_cluster
    cluster = groups.pop(CITUS_COORDINATOR_GROUP_ID, Cluster.empty())
AttributeError: 'Cluster' object has no attribute 'pop'
```

It is broken since #2909.

In addition to that fix `_citus_cluster_loader()` interface by allowing it to return only dict obj.
2023-11-09 11:09:38 +01:00
Alexander Kukushkin
3ffd598a1c Do a real http request when performing name uniqueness check (#2942)
When running in containers it is possible that the traffic is routed using `docker-proxy`, which listens on the port and accepting incoming connections.

This commit effectively sticks to the original solution from #2878
2023-11-08 14:08:02 +01:00
Alexander Kukushkin
552e8643d9 Verify that replica nodes received checkpoint LSN on shutdown (#2939)
In case if archiving is enabled the `Postgresql.latest_checkpoint_location()` method returns LSN of the prev (SWITCH) record, which points to the beginning of the WAL file. It is done in order to make it possible to safely promote replica which recovers WAL files from the archive and wasn't streaming when the primary was stopped (primary doesn't archive this WAL file).

But, in certain cases using the LSN pointing to SWITCH record was causing unnecessary pg_rewind, if replica didn't managed to replay shutdown checkpoint record before it was promoted.

In order to mitigate the problem we need to check that replica received/replayed exactly the shutdown checkpoint LSN. But, at the same time we will still write LSN of the SWITCH record to the `/status` key when releasing the leader lock.
2023-11-07 11:05:54 +01:00
Israel
269b04be5d Add a contrib script for remote Barman recovery (#2931)
A contrib script, which can be used as a custom bootstrap method, or as a custom create replica method.

The script communicates with the pg-backup-api on the Barman node so Patroni is able to restore a Barman backup remotely.

The `--help` option of the script, along with the script docstring, should provide some context on how to use fill its parameters.

Patroni docs were updated accordingly to share examples about how to configure the script as a custom bootstrap method, or as a custom create replica method.

References: PAT-216.
2023-11-06 16:25:27 +01:00
Alexander Kukushkin
3d527f5728 Improve formatting of generated config and validation of ints (#2928)
- order sections similar to sample configs
- add warnings and comments to `bootstrap.dcs` section.
- add `tags` and `log` sections.
- use discovered IPs in `postgresql.connect_address` and `postgresql.listen`
- set `wal_level` to `replica` for PostgreSQL 9.6+
- make unit tests pass with python 3.6
- improve config validator so it doesn't complain when some ints are strings in YAML file.
2023-10-25 14:23:57 +02:00
Alexander Kukushkin
94e128c51a Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-10-25 10:43:21 +02:00
Mark Pekala
f5ee67fa1c Feature: failover priority (#2780)
The priority is configured with `failover_priority` tag. Possible values are from `0` till infinity, where `0` means that the node will never become the leader, which is the same as `nofailover` tag set to `true`. As a result, in the configuration file one should set only one of `failover_priority` or `nofailover` tags.

The failover priority kicks in only when there are more than one node have the same receive/replay LSN and are ahead of other nodes in the cluster. In this case the node with higher value of `failover_priority` is preferred. If there is a node with higher values of receive/replay LSN, it will become the new leader even if it has lower value of `failover_priority` (except when priority is set to 0).

Close https://github.com/zalando/patroni/issues/2759
2023-10-24 12:22:48 +02:00
Alexander Kukushkin
f32989124c Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-10-23 15:28:54 +02:00
Alexander Kukushkin
d471f1156d Handle AuthOldRevision error (#2913)
The error is raised if Etcd is configured to use JWT auth tokens and when the user database in Etcd is updated, because the update invalidates all tokens.

If retries are requested - try to get a new new token and repeat the request. Repeat it in a loop until request is successfully executed or until `retry_timeout` is exhausted. This is the only way of solving a race condition, because between authentication and executing the request yet another modification of the user database in Etcd might happen.

In case if the request doesn't have to be immediately retried - set a flag that the next API request should perform the authentication first and let Patroni to naturally repeat the request on the next heartbeat loop.

Co-authored-by: Kenny Do <kedo@render.com>
Ref: https://github.com/zalando/patroni/pull/2911
2023-10-23 14:00:37 +02:00
Alexander Kukushkin
c5fffb3c97 Further work on permanent physical slots (#2891)
- Fixed issues with has_permanent_slots() method. It didn't took into account the case of permanent physical slots for members, falsely concluding that there are no permanent slots.
- Write to the status key only LSNs for permanent slots (not just for slots that exist on the primary).
  - Include pg_current_wal_flush_lsn() to slots feedback, so that slots on standby nodes could be advanced
- Improved behave tests:
  - Verify that permanent slots are properly created on standby nodes
  - Verify that permanent slots are properly advanced, including DCS failsafe mode
  - Verify that only permanent slots are written to the `/status`
2023-10-23 08:24:28 +02:00
zhjwpku
260ab36f2e mock getaddrinfo in case test failure (#2918)
Close #2915
2023-10-17 19:53:19 +02:00
Alexander Kukushkin
fc67ba73f0 Allow to specify psycopg* in extras and switch to build (#2907)
* remove check_psycopg() call from the setup.py, when installing from wheel it doesn't work anyway.
* call check_psycopg() function before process_arguments(), because the last one is trying to import psycopg and fails with the stacktrace, while the first one shows a nice human-readable error message.
* add psycopg2, psycopg2-binary, and psycopg3 extras, that will install psycopg2>=2.5.4, psycopg2-binary, or psycopg[binary]>=3.0.0 modules respectively.
* move check_psycopg() function to the __main__.py.
* introduce the new extra called `all`, it will allow to install all dependencies at once (except psycopg related).
* use the `build` module in order to create sdist bdist_wheel packages.
* update the documentation regarding psycopg and extras (dependencies).
2023-10-17 14:46:15 +02:00
Alexander Kukushkin
aa3ebe0af8 Don't cache anything in Zookeeper implementation (#2909)
Cache creates a lot of problems and prevents implementing a feature of automatic retention of physical replication slots for members with configurable retention policy.

Just read the entire cluster from Zookeeper instead and use watchers only for the `/leader` and `/config` keys.
2023-10-17 08:56:31 +02:00
Alexander Kukushkin
d93db20baa Set citus.local_hostname (#2903)
There are cases when Citus wants to have a connection to the local postgres. By default it uses `localhost` for that, which is not alwasy available. To solve it we will set `citus.local_hostname` GUC to custom value, which is the same as Patroni uses to connect to Postgres.
2023-10-16 10:21:50 +02:00
Alexander Kukushkin
5c6b34a757 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-10-10 09:57:04 +02:00