Commit Graph

19 Commits

Author SHA1 Message Date
Alexander Kukushkin
a8cfd46801 Retry one time on Etcd3 auth error (#3026)
But do it only in case if we didn't authenticate right before executing a request. Previously retries only happened when the caller was executed with `Retry.__call__()`, which is not the case for methods like `set_failover_value()` or `set_config_value()`. Also, it seems that existing watchers aren't affected, therefore we will not restart them after reauthentication.

In addition to that fix issues with `Retry.ensure_deadline(0)`:
1. the return value was ignored
2. we don't have to set `Retry.deadline` attr, it is not used anywhere

Close https://github.com/zalando/patroni/issues/3023
2024-03-07 12:01:35 +01:00
Alexander Kukushkin
bcfd8438a5 Abstract CitusHandler and decouple it from configuration (#2950)
the main issue was that the configuration for Citus handler and for DCS existed in two places, while ideally AbstractDCS should not know many details about what kind of MPP is in use.

To solve the problem we first dynamically create an object implementing AbstractMPP interfaces, which is a configuration for DCS. Later this object is used to instantiate the class implementing AbstractMPPHandler interface.

This is just a starting point, which does some heavy lifting. As a next steps all kind of variables named after Citus in files different from patroni/postgres/mpp/citus.py should be renamed.

In other words this commit takes over the most complex part of #2940, which was never implemented.

Co-authored-by: zhjwpku <zhjwpku@gmail.com>
2023-12-21 08:58:26 +01:00
Alexander Kukushkin
d471f1156d Handle AuthOldRevision error (#2913)
The error is raised if Etcd is configured to use JWT auth tokens and when the user database in Etcd is updated, because the update invalidates all tokens.

If retries are requested - try to get a new new token and repeat the request. Repeat it in a loop until request is successfully executed or until `retry_timeout` is exhausted. This is the only way of solving a race condition, because between authentication and executing the request yet another modification of the user database in Etcd might happen.

In case if the request doesn't have to be immediately retried - set a flag that the next API request should perform the authentication first and let Patroni to naturally repeat the request on the next heartbeat loop.

Co-authored-by: Kenny Do <kedo@render.com>
Ref: https://github.com/zalando/patroni/pull/2911
2023-10-23 14:00:37 +02:00
Alexander Kukushkin
9209a5a133 Refactor delete_leader interface (#2810)
similar to https://github.com/zalando/patroni/pull/2690, but it helps mostly Consul implementation.
2023-08-11 10:19:29 +02:00
Alexander Kukushkin
8f3ed00886 Invalidate cache if txn failed due to revision mismatch (#2783)
It was reported in #2779 that the primary was constantly logging messages like `Synchronous replication key updated by someone else`.

It happened after Patroni was stuck due to resource starvation. Key updates are performed using create_revision/mod_revision field, which value is taken from the internal cached. Hence, it is a clear symptom of stale cache.

Similar issues in K8s implementation were addressed by invalidating the cache and restarting watcher connections every time when update failed due to resource_version mismatch, so we do the same for Etcd3.
2023-07-31 10:16:19 +02:00
Alexander Kukushkin
7e89583ec7 Please new flake8 (#2789)
it stopped liking lack of space character between `,` and `\`
```python
foo,\
    bar
```
2023-07-31 09:08:46 +02:00
Alexander Kukushkin
af8e5f0d0f Refactor update_leader interface (#2690)
pass reference to a last known leader object in order to avoid obtaining it from the `AbstractDCS.cluster` cache.

This change is useful for Consul, Etcd3 and Zookeeper implementations.
2023-05-25 14:21:05 +02:00
Alexander Kukushkin
1c7bf2f59e Fix a problem with etcd3.update_leader() (#2693)
It didn't took into account the fact that we can get a new lease after changing a TTL. In this case we have to update the exiting leader key with the new lease.
To solve it we introduce the on transaction 'failure' callback. The whole workflow looks like (schematically):
```python
txn(
    compare=(old_value == self._name)),
    success=put(self.leader_path, self._name, self._lease),
    failure=txn(
        compare=(create_revision == '0'),
        success=put(self.leader_path, self._name, self._lease)
    )
)
```

The problem was introduced in d98d6d0b02c9cc67464a7b1b31b1c5570c26e12d
2023-05-25 10:01:43 +02:00
Alexander Kukushkin
b4afc6830b Little fixes in etcd3 and kubernetes (#2689)
- Always pass etcd3 key revision as a string
- Make sure the leader key isn't unconditionally overwritten. It may happen that the leader heart-beat loop didn't run properly and the session has expired. In this case the leader may create a new session and a new leader key. But, there are chances that the other node already created a leader key and we don't want to overwrite it.
- Try to sync HA loops between nodes by adding 0.5 seconds to timeout on non-leader nodes
2023-05-24 10:54:26 +02:00
Alexander Kukushkin
7941c86775 Refactor write_sync_state() (#2669)
Make it return the new `SyncState` object in order to avoid reading the new cluster state in the Ha.process_sync_replication().

Now it is a small optimization, but it will become very handy in the quorum commit feature.
2023-05-11 09:58:15 +02:00
Alexander Kukushkin
76b3b99de2 Enable pyright strict mode (#2652)
- added pyrightconfig.json with typeCheckingMode=strict
- added type hints to all files except api.py
- added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules
- added type stubs for psycopg2 and urllib3 with some little fixes
- fixes most of the issues reported by pyright
- remaining issues will be addressed later, along with enabling CI linting task
2023-05-09 09:38:00 +02:00
Polina Bungina
3fe2a7868a Ignore D401 in flake8-docstrings (#2627)
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2023-04-03 09:52:22 +02:00
Alexander Kukushkin
4872ac51e0 Citus integration (#2504)
Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together:
```
/service/batman/
/service/batman/0/
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/
/service/batman/1/
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m1
/service/batman/1/members/m2
...
```

Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups.

Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper).

The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group.

Besides that we introduce a new method  get_citus_coordinator(). It will be used only by worker clusters.

Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates.
E.g.
```
batman-0-leader  # the leader config map for the coordinator
batman-0-config  # the config map holding initialize, config, and history "keys"
...
batman-1-leader  # the leader config map for worker group 1
batman-1-config
...
```

Citus integration is enabled from patroni.yaml:
```yaml
citus:
  database: citus
  group: 0  # 0 is for coordinator, 1, 2, etc are for workers
```

If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file.

When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section.

Besides that, Patroni takes over management of some Postgres GUCs:
- `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place
- `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2`
- wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal.

The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using
citus_add_node() and citus_update_node() functions.

Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries.
When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker.
Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked.
If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike.

All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.
2023-01-24 16:14:58 +01:00
Alexander Kukushkin
92d3e1c167 Introduce the failsafe key in DCS (#2485)
Extracted from #2379
2022-12-13 11:35:06 +01:00
Alexander Kukushkin
6ad5fee99d Raise DCSError when communication with DCS fails (#2484)
Previously such an exception was raised only from the `get_cluster()` method, and now we will to do the same from the `update_leader()` and `attempt_to_acquire_leader()` methods.

These methods influence Postgres promotion and demotion and we want to make a difference between different types of failures. Specifically, if calls have failed because DCS isn't accessible or due to a timeout.

This commit is extracted from the #2379
2022-12-13 11:06:55 +01:00
Alexander Kukushkin
448d703733 Explicitely request cluster version when connecting to etcd via proxy (#1974)
Close https://github.com/zalando/patroni/issues/1971
2021-06-24 08:51:07 +02:00
Alexander Kukushkin
c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00
Alexander Kukushkin
8b5cb85536 Exit only if authentication explicitly failed (#1806)
It could happen that one of etcd servers is not accessible on Patroni start.
In this case Patroni was trying to perform authentication and exiting, while it should exit only if Etcd explicitly responded with the `AuthFailed` error.

Close https://github.com/zalando/patroni/issues/1805
2021-01-15 14:31:33 +01:00
Alexander Kukushkin
3341c898ff Add Etcd v3 protocol support via api gRPC-gateway (#1162)
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.

Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.

Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.

Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
2020-07-31 14:33:40 +02:00