We should ignor the former leader with higher priority when it reports the same LSN as the current node.
This bug could be a contributing factor to issues described in #3295
In addition to that mock socket.getaddrinfo() call in test_api.py to avoid hitting DNS servers.
The first one if available starting from PostgreSQL v13 and contains the
real write LSN. We will prefer it over value returned by
pg_last_wal_receive_lsn(), which is in fact flush LSN.
The second one is available starting from PostgreSQL v9.6 and points to
WAL flush on the source host. In case of primary it will allow to better
calculate the replay lag, because values stored in DCS are updated only
every loop_wait seconds.
To enable quorum commit:
```diff
$ patronictl.py edit-config
---
+++
@@ -5,3 +5,4 @@
use_pg_rewind: true
retry_timeout: 10
ttl: 30
+synchronous_mode: quorum
Apply these changes? [y/N]: y
Configuration changed
```
By default Patroni will use `ANY 1(list,of,stanbys)` in `synchronous_standby_names`. That is, only one node out of listed replicas will be used for quorum.
If you want to increase the number of quorum nodes it is possible to do it with:
```diff
$ patronictl edit-config
---
+++
@@ -6,3 +6,4 @@
retry_timeout: 10
synchronous_mode: quorum
ttl: 30
+synchronous_node_count: 2
Apply these changes? [y/N]: y
Configuration changed
```
Good old `synchronous_mode: on` is still supported.
Close https://github.com/patroni/patroni/issues/664
Close https://github.com/zalando/patroni/pull/672
The `Status` class was introduced in #2853, but we kept old properties in the `Cluster` object in order to have fewer changes in the rest of the code.
This PR is finishing the refactoring.
The following adjustments were made:
- Introduced `Status.is_empty()` method, which is used in the `Cluster.is_empty()` instead of checking actual values to simplify introduction of further fields to the Status object.
- Removed `Cluster.last_lsn` property
- Changed `Cluster.slots` property to always return dict and perform sanity checks on values.
Besides that, this PR addressing a couple of problems:
- the `AbstractDCS.get_cluster()` method some properties without holding a lock on `_cluster_thread_lock`.
- `Cluster.__permanent_slots` property was setting 'lsn' from all cluster members, while it should be doing that only for members with `replicatefrom` tag.
obey the following 5 meanings of terminology _cluster_ in Patroni.
1. PostgreSQL cluster: a cluster of postgresql instances which have the same system identifier.
2. MPP cluster: a cluster of PostgreSQL clusters that one of them acts as Coodinator and others act as workers.
3. Coordinator cluster: a PostgreSQL cluster which act the role of 'coordinator' within a MPP cluster.
4. Worker cluster: a PostgreSQL cluster which act the role 'worker' within a MPP cluster.
5. Patroni cluster: all cluster managed by Patroni can be called Patroni cluster, but we usually use this term to refering a single PostgreSQL cluster or an MPP cluster.
Provide info about the PG parameters that caused "pending restart"
flag to be set. Both `patronictl list` and `/patroni` REST API endpoint
now show the parameters names and the diff as the "pending restart
reason".
1. extract `GlobalConfig` class to its own module
2. make the module instantiate the `GlobalConfig` object on load and replace sys.modules with the this instance
3. don't pass `GlobalConfig` object around, but use `patroni.global_config` module everywhere.
4. move `ignore_slots_matchers`, `max_timelines_history`, and `permanent_slots` from `ClusterConfig` to `GlobalConfig`.
5. add `use_slots` property to global_config and remove duplicated code from `Cluster` and `Postgresql.ConfigHandler`.
Besides that improve readability of couple of checks in ha.py and formatting of `/config` key when saved from patronictl.
- Don't set leader in failover key from patronictl failover
- Show warning and execute switchover if leader option is provided for patronictl failover command
- Be more precise in the log messages
- Allow to failover to an async candidate in sync mode
- Check if candidate is the same as the leader specified in api
- Fix and extend some tests
- Add documentation
Sharing a single connection between REST API and the main thread (doing heartbeats) was working mostly fine, except when Postgres becomes so slow that REST API queries start blocking the main loop.
If the dedicated REST API connection isn't available we use the heartbeat connection as a fallback.
The same (almost) logic was used in three different places:
1. `Patroni` class
2. `Member` class
3. `_MemberStatus` class
Now they all inherit newly intoduced `Tags` class.
1. stop using the same cursor all the time, it creates problems when not carefully used from different threads.
2. introduce query() method in the Connection class and make it return a result set when it is possible.
3. refactor most of the code that is relying (directly or indirectly) on the Connection object to use the query() method as much as possible.
This refactoring helps with reducing code complexity and will help with future introduction of a separate database connection for the REST API thread. The last one will help to improve reliability when system is under significant stress when simple monitoring queries are taking seconds to execute and the REST API starts blocking the main thread.
Due to historical reasons (not available before 9.6) we used `pg_current_wal_lsn()`/`pg_current_xlog_location()` functions to get current WAL LSN on the primary. But, this LSN is not necessarily synced to disk, and could be lost if the primary node crashed.
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.
In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.
Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 | 0 |
+-------------+----------------+---------+---------------------+----+-----------+
$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
```
Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544480,
"replayed_location": 335544480,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "in archive recovery",
"dcs_last_seen": 1688642069,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544816,
"replayed_location": 335544816,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "streaming",
"dcs_last_seen": 1688642089,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
```
Revert to using `ssl._ssl._test_decode_cert`
A change has been included as part of Patroni 3.0.3 release: use
public functions instead of `ssl._ssl._test_decode_cert` to get
serial number of certificates.
There was a slight bug in that implementation: it was only loading
the certificates through `load_verify_locations`, but was missing
to get the certificates through `get_ca_certs`. As a consequence
Patroni was not able anymore to reload REST API cert on SIGHUP.
An attempt to fix that issue was made through commit
`20f578f09f3aa604e5288710d4fd4e611152ed5f`. However, even with the
correct call of `get_ca_certs`, it was detected a corner case where
`load_verify_locations` would skip loading a certificate: if it was
issued with `CA:FALSE`. That essentially means the implementation is
still buggy in that situation. See [CPython](c283a0cff5/Modules/_ssl.c (L4618C14-L4619))
for the underlying problem.
In order to get back a functional implementation again we are reverting
the code to use the private function `ssl._ssl._test_decode_cert`.
We can later study a possible more elegant alternative for solving this,
if any.
---------
Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
Previously it used to compare between the leader and sync_standbys, while in some cases (actually most of them) the leader should be excluded.
This commit makes `matches()` method flexible:
1. The leader will be included to comparison only if requested
2. checks will be performed as case insensitive (like PG does)
Besides that, everywhere in code start using `cluster.sync.matches()` instead of `name in cluster.sync.members`.
keep as much backward compatibility as possible.
Following changes were made:
1. All internal checks are performed as `role in ('master', 'primary')`
2. All internal variables/functions/methods are renamed
3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`.
4. Logs are changed to use leader/primary/member/remote depending on the context
5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works.
6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`.
7. `master_(start|stop)_timeout` is automatically translated to `primary_(start|stop)_timeout` if the last one is not set.
8. updated the documentation and some examples
Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether.
The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.
Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together:
```
/service/batman/
/service/batman/0/
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/
/service/batman/1/
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m1
/service/batman/1/members/m2
...
```
Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups.
Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper).
The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group.
Besides that we introduce a new method get_citus_coordinator(). It will be used only by worker clusters.
Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates.
E.g.
```
batman-0-leader # the leader config map for the coordinator
batman-0-config # the config map holding initialize, config, and history "keys"
...
batman-1-leader # the leader config map for worker group 1
batman-1-config
...
```
Citus integration is enabled from patroni.yaml:
```yaml
citus:
database: citus
group: 0 # 0 is for coordinator, 1, 2, etc are for workers
```
If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file.
When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section.
Besides that, Patroni takes over management of some Postgres GUCs:
- `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place
- `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2`
- wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal.
The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using
citus_add_node() and citus_update_node() functions.
Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries.
When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker.
Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked.
If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike.
All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.
If enabled it will allow Patroni to cope with DCS outages.
In case of a DCS outage the leader tries to call all remaining members in the cluster via API and if all of them respond with success the leader will not be demoted.
The failsafe_mode could be enabled by running
```sh
patronictl edit-config -s failsafe_mode=true
```
or by calling the `/config` REST API endpoint.
Co-authored-by: Polina Bungina <bungina@gmail.com>
The HAProxy is closing connections as soon as it got the HTTP Status code leaving no time for Patroni to properly shutdown SSL connection.
Close https://github.com/zalando/patroni/issues/2466
Windows doesn't support `SIGTERM`, but our behave tests in majority of cases relying on Patroni graceful shutdown.
In order to emulate the behaviour we introduced the new REST API endpoint `POST /sigterm`. The endpoint works only on Windows and when `BEHAVE_DEBUG` environment variable is set.
Besides that some minor adjustments in behave tests were done. Mainly related to backslash-slash handling.
In addition to that improve test coverage on Windows by properly mocking access to filesystem and avoiding calling
`subprocess.call()`. Specifically, symlink creation on Windows requires Admin privileges and there is no `true.exe`.
When switching certificates there is a race condition with a concurrent API request. If there is one active during the replacement period then the replacement will error out with a port in use error and Patroni gets stuck in a state without an active API server.
Fix is to call server_close after shutdown which will wait for already running requests to complete before returning.
Close#2184
This field notes the last time (as unix epoch) a cluster member has successfully communicated with the DCS. This is useful to identify and/or analyze network partitions.
Also, expose dcs_last_seen in the MemberStatus class and its from_api_response() method.
If configured, only IPs that matching rules would be allowed to call unsafe endpoints.
In addition to that, it is possible to automatically include IPs of members of the cluster to the list.
If neither of the above is configured the old behavior is retained.
Partially address https://github.com/zalando/patroni/issues/1734
Promoting the standby cluster requires updating load-balancer health checks, which is not very convenient and easy to forget.
In order to solve it, we change the behavior of the `/leader` health-check endpoint. It will return 200 without taking into account whether PostgreSQL is running as the primary or the standby_leader.
Effectively, this PR consists of a few changes:
1. The easy part:
In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.
2. The tricky part:
On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function.
When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.
3. Additional requirements:
In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.
4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.
Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.
The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.
Close: https://github.com/zalando/patroni/issues/1749
This patch fixes the error handling of cases where there are runtime errors in `socketserver`.
For example, when creating a new thread (to handle a request) fails.
`get_request` handles ssl connections by replacing the new client socket by a tuple containing `(server_socket, new_client_socket)` in order to later deal with handshakes in `process_request_thread`
During the processing of a request, the socketserver `BaseServer` calls `handle_request`, calling the `_handle_request_noblock`, which is calling the following functions (https://github.com/python/cpython/blob/3.8/Lib/socketserver.py#L303):
```
request, client_addr = get_request()
verify_request(request, client_address):
process_request(request, client_address)
handle_error(request, client_address)
shutdown_request(request)
```
- `get_request` is overloaded in patroni and returns `request` as a tuple in case of ssl calls
- `verify_request` defaults to `return True` and should be fixed if used but is fine in this case
- `process_request` just calls `process_request_thread` (which is overloaded in patroni and handles tuple-style requests)
- `handle_error` is overloaded in patroni and handles tuple-style requests)
- but `shutdown_request` is not overloaded and thus missing support for tuple-style requests
This patch adds support for tuple-style requests in patroni api
* update release notes
* bump version
* change the default alignment in patronictl table output to `left`
* add missing tests
* add missing pieces to the documentation