Provide info about the PG parameters that caused "pending restart"
flag to be set. Both `patronictl list` and `/patroni` REST API endpoint
now show the parameters names and the diff as the "pending restart
reason".
the main issue was that the configuration for Citus handler and for DCS existed in two places, while ideally AbstractDCS should not know many details about what kind of MPP is in use.
To solve the problem we first dynamically create an object implementing AbstractMPP interfaces, which is a configuration for DCS. Later this object is used to instantiate the class implementing AbstractMPPHandler interface.
This is just a starting point, which does some heavy lifting. As a next steps all kind of variables named after Citus in files different from patroni/postgres/mpp/citus.py should be renamed.
In other words this commit takes over the most complex part of #2940, which was never implemented.
Co-authored-by: zhjwpku <zhjwpku@gmail.com>
Exclude actual leader (not the passed leader argument) from the
candidates list in the `patronictl failover` prompt.
Abort `patronictl failover` execution if candidate specified is
the same as the current cluster leader
1. extract `GlobalConfig` class to its own module
2. make the module instantiate the `GlobalConfig` object on load and replace sys.modules with the this instance
3. don't pass `GlobalConfig` object around, but use `patroni.global_config` module everywhere.
4. move `ignore_slots_matchers`, `max_timelines_history`, and `permanent_slots` from `ClusterConfig` to `GlobalConfig`.
5. add `use_slots` property to global_config and remove duplicated code from `Cluster` and `Postgresql.ConfigHandler`.
Besides that improve readability of couple of checks in ha.py and formatting of `/config` key when saved from patronictl.
The `obj` could be easily obtained with the help of `click.get_current_context().obj`.
Introduced function `is_citus_cluster()` will simplify future refactoring to add support of other MPP databases.
In addition to that refactor ctl.py unit tests by moving most of mocks to the global scope.,
Previous to this commit `patronictl query` was working only if `-r` argument was provided to the command. Otherwise it would face issues:
* If neither `-r` nor `-m` were provided:
```
PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT"
2023-09-12 17:45:38 No connection to role=None is available
```
* If only `-m` was provided:
```
$ PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT" -m postgresql0
2023-09-12 17:46:15 No connection to member postgresql0 is available
```
This issue was a regression introduced by `4c3e0b9382820524239d2aa4d6b95379ef1291db` through PR #2687.
Through that PR we decided to move the common logic used to check mutually exclusiveness of `--role` and `--member` arguments to `get_any_member` function.
However, previous to that change `role` variable would assume the default value of `any` in `query` method, before `get_any_member` was called, which was not the case after the change.
This commit fixes that issue by adding a handler in `get_cursor` function to `role=None`. As `role` defaulting to `any` is handled in a sub-call to `get_any_member`, we are apparently safe in `get_cursor` to return the cursor if `role=None`.
Unit tests were updated accordingly.
References: PAT-204.
- Don't set leader in failover key from patronictl failover
- Show warning and execute switchover if leader option is provided for patronictl failover command
- Be more precise in the log messages
- Allow to failover to an async candidate in sync mode
- Check if candidate is the same as the leader specified in api
- Fix and extend some tests
- Add documentation
1. Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used.
2. Document that if `restapi.verify_client` is set to `required` then client certificates **must** be provided in the `ctl` section.
3. Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`.
4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.
- added pyrightconfig.json with typeCheckingMode=strict
- added type hints to all files except api.py
- added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules
- added type stubs for psycopg2 and urllib3 with some little fixes
- fixes most of the issues reported by pyright
- remaining issues will be addressed later, along with enabling CI linting task
Previous to this commit the `parse_dcs` function would fail with a `PatroniCtlException` if the user ever passed an `etcd3` URL through `--dcs-url` command-line option.
As a consequence, the only way of using `etcd3` in `patronictl` was by using a configuration file passed through `-c` command-line option.
This commit fixes that issue and allows `etcd3` to be used in `--dcs-url`.
References: PAT-91
Close#2638
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
despite being EOL the number of downloads from pypi for 3.6 is on the first place, hence we need to support it.
Fixed a couple of problems in tests:
1. pytest complains about `call` imported from mock, mock.call() is fine
2. one test of Citus intergration was broken
`patronictl edit-config` requires a pager to show the diff output back to the user. It used to be hard-coded to use either `less` or `more`.
When these tools were not available in the host that would cause `patronictl` to face an exception in `ydiff` module and to show the stack trace in the console.
This PR changes `patronictl edit-config` command to behave like this:
- If `PAGER` environment variable is set, attempt to find the corresponding executable.
- If `PAGER` is not set or is set with an invalid executable, then attempt to use either `less` or `more` as it used to do.
- If no executable is find at all then throw a `PatroniCtlException` to show an user friendly message
Unit tests in `tests/test_ctl.py` were modified accordingly.
References: PAT-21
Close#2604
If communication with etcd nodes failed it is logical to start from scratch, from nodes that are listed in the config. But, it could happen that config is in fact outdated and all nodes in the real cluster were replaced.
Previously we used to track whether config file was changed, which turned out not to work in all possible cases.
The new strategy is a bit more different - if communication with all nodes failed we will continue keeping the last know topology and at the same time will try to figure out the new one by merging two lists together, the cached list and the list from the config file.
keep as much backward compatibility as possible.
Following changes were made:
1. All internal checks are performed as `role in ('master', 'primary')`
2. All internal variables/functions/methods are renamed
3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`.
4. Logs are changed to use leader/primary/member/remote depending on the context
5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works.
6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`.
7. `master_(start|stop)_timeout` is automatically translated to `primary_(start|stop)_timeout` if the last one is not set.
8. updated the documentation and some examples
Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether.
The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.
Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together:
```
/service/batman/
/service/batman/0/
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/
/service/batman/1/
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m1
/service/batman/1/members/m2
...
```
Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups.
Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper).
The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group.
Besides that we introduce a new method get_citus_coordinator(). It will be used only by worker clusters.
Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates.
E.g.
```
batman-0-leader # the leader config map for the coordinator
batman-0-config # the config map holding initialize, config, and history "keys"
...
batman-1-leader # the leader config map for worker group 1
batman-1-config
...
```
Citus integration is enabled from patroni.yaml:
```yaml
citus:
database: citus
group: 0 # 0 is for coordinator, 1, 2, etc are for workers
```
If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file.
When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section.
Besides that, Patroni takes over management of some Postgres GUCs:
- `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place
- `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2`
- wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal.
The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using
citus_add_node() and citus_update_node() functions.
Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries.
When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker.
Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked.
If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike.
All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.
* Remove patronictl configure command
* Change name of the "secret" ENV variable (DCS->DCS_URL) and the corresponding patronictl option (to avoid mixing it up with the one from tests)
Windows doesn't support `SIGTERM`, but our behave tests in majority of cases relying on Patroni graceful shutdown.
In order to emulate the behaviour we introduced the new REST API endpoint `POST /sigterm`. The endpoint works only on Windows and when `BEHAVE_DEBUG` environment variable is set.
Besides that some minor adjustments in behave tests were done. Mainly related to backslash-slash handling.
In addition to that improve test coverage on Windows by properly mocking access to filesystem and avoiding calling
`subprocess.call()`. Specifically, symlink creation on Windows requires Admin privileges and there is no `true.exe`.
* update release notes
* bump version
* change the default alignment in patronictl table output to `left`
* add missing tests
* add missing pieces to the documentation
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.
Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.
Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.
Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
$ python3 patronictl.py -c postgresql0.yml list
Error: Provided config file postgresql0.yml not existing or no read rights. Check the -c/--config-file parameter
That required a refactoring of `Config` and `Patroni` classes. Now one has to explicitely create the instance of `Config` before creating `Patroni`.
The Config file can optionally call the validate function.
We will try to import only the module which has a configuration section.
I.e. if there is only zookeeper section in the config, Patroni will try to import only `patroni.dcs.zookeeper` and skip `etcd`, `consul`, and `kubernetes`.
This approach has two benefits:
1. When there are no dependencies installed Patroni was showing INFO messages `Failed to import smth`, which looks scary.
2. It reduces memory usage, because sometimes dependencies are heavy.
The /history endpoint shows the content of the `history` key in DCS
The /cluster endpoint show all cluster members and some service info like pending and scheduled restarts or switchovers.
In addition to that implement `patronictl history`
Close#586Close#675Close#1133
* make it possible to use client certificates with REST API
* define a separate PatroniRequest class which handles all communication
* refactor patronictl to use the new class
* make Ha to use the new class instead of calling requests.get. The old call wasn't taking into account certificates and basic-auth
Close#898