Commit Graph

177 Commits

Author SHA1 Message Date
Matt Baker
d2603402ea Debian docker image pip error (#2849)
* Use virtualenv to install tox in behave Dockerfile

Upstream change in postgres docker image uses debian restriction on
installing system-wide non-debian python packages. Debian doesn't
provide a tox>=4, so we need to install with pip.

* Exclude all output directories generated using `tox-wrapper.sh`

The `tox-wrapper.sh` script created by `features/Dockerfile` creates
directories like features/output-tox-pg14-docker-behave-etcd-lin-973719674/

* Reduce footprint of tox behave docker image
2023-09-04 21:24:26 +02:00
Alexander Kukushkin
7e89583ec7 Please new flake8 (#2789)
it stopped liking lack of space character between `,` and `\`
```python
foo,\
    bar
```
2023-07-31 09:08:46 +02:00
Alexander Kukushkin
06db296612 Fixes in patroni.request (#2768)
1.  Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used.
2. Document that if `restapi.verify_client` is set to `required` then client certificates **must** be provided in the `ctl` section.
3.  Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`.
4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.
2023-07-25 08:48:18 +02:00
Alexander Kukushkin
0a8fb0860e Skip flaky scenario when running with Raft (#2771)
Sometimes Patroni doesn't see the latest Raft data on start.
2023-07-21 16:09:34 +02:00
Alexander Kukushkin
d46ca88e6b Make it visible replication state on standbys (#2733)
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.

In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.

Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member      | Host           | Role    | State               | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader  | running             | 12 |           |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 |         0 |
+-------------+----------------+---------+---------------------+----+-----------+

$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader  | running   | 12 |           |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
```

Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
  "role": "replica",
  "server_version": 150003,
  "xlog": {
    "received_location": 335544480,
    "replayed_location": 335544480,
    "replayed_timestamp": null,
    "paused": false
  },
  "timeline": 12,
  "replication_state": "in archive recovery",
  "dcs_last_seen": 1688642069,
  "database_system_identifier": "7252327498286490579",
  "patroni": {
    "version": "3.0.3",
    "scope": "batman"
  }
}

$ curl -s localhost:8009 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
  "role": "replica",
  "server_version": 150003,
  "xlog": {
    "received_location": 335544816,
    "replayed_location": 335544816,
    "replayed_timestamp": null,
    "paused": false
  },
  "timeline": 12,
  "replication_state": "streaming",
  "dcs_last_seen": 1688642089,
  "database_system_identifier": "7252327498286490579",
  "patroni": {
    "version": "3.0.3",
    "scope": "batman"
  }
}
```
2023-07-13 09:24:20 +02:00
Alexander Kukushkin
6e96db173f Start postgres not in recovery in some cases (#2726)
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720.

In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`.

P.S. now we do not cover cases when Patroni was killed along with Postgres.
Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
2023-07-12 09:42:34 +02:00
Alexander Kukushkin
b8cff3515a Reduce flakiness of citus behave tests, take 2 (#2742)
Reorder some checks and verify that the old primary is already in the `running` state before checking replication. This check elliminates the race condition when replication started to work but node name is removed from the `synchronous_standby_names` because state isn't `running`.
2023-07-11 15:04:10 +02:00
Mark Pekala
412c51ddf1 Prevent splitbrain from duplicate names in configuration (#2724)
When starting check if node with the same is registered in DCS and try to query it's REST API.
If REST API is accessible exit with the error.

Close #1804
2023-07-11 07:43:57 +02:00
Feike Steenbergen
4725f12f9a Allow integer gucs without units in validation (#2734)
Previously, integer gucs, for example `max_connections` would not pass the validation, as these settings have no unit, if and only if they were specified as a string.

This causes problems if the `max_connections` is configured in `patroni.yaml` as a string, for example, the following configuration would not result in the right `max_connections` settings, as `max_connections` is configured as a string:

    bootstrap:
      dcs:
        postgresql:
          parameters:
            log_checkpoints: "on"
            log_connections: "off"
            max_connections: "57"

Allowing a user to specify *all* parameters as a string was accepted before in Patroni and also seems very useful, as many of us will be using Ansible/Helm/Golang to build a Patroni configuration, in which creating a `map[string]string` is easier than having to deal with data types.

Attemps to address issue #2735 

Regression was introduced in 76b3b99de2
2023-07-10 13:44:54 +02:00
Alexander Kukushkin
1c36112b44 Reduce flakiness of citus behave tests (#2728)
* Reduce flakiness of citus behave tests

- make a few attempts with timeout  when checking registered nodes
- get rid from artificial sleep
- allow check_registration() function to check secondaries

These changes are useful for Quorum based failover (#2668) and future PR
that enhances Citus support by registering secondaries in `pg_dist_node`.
2023-07-07 15:23:04 +03:00
Alexander Kukushkin
af318b2473 Fix kubernetes behave tests (#2707)
Starting from 1.27 there is containerd process, which also uses k3s binary and being detected by pidof. Therefore we will search for "k3s server" string in the process list instead of just "k3s".
2023-06-01 13:28:29 +02:00
Matt Baker
2158f4a87b Add base image build arg for alt postgres (#2695)
Allows for running behave tests with an alternative base image
than the official postgres image.

Also provides a PG_USER/PG_GROUP should that be different to the
default `postgres`.
2023-05-26 09:35:12 +02:00
Matt Baker
73797e8572 Add tox configuration for running multiple test envs (#2603) 2023-05-24 10:58:04 +02:00
Polina Bungina
ab9fea7d6b Fix openssl certificate generation in behave tests (#2672)
--addext -> -addext (doesn't work on macOS)
set keyfile permissions to 600 (to avoid "private key file has group or world access")
2023-05-12 10:42:53 +02:00
Alexander Kukushkin
4d35f85b87 Fix behave tests (#2656)
1. specify `subjectAltName=IP:127.0.0.1` when generating certificate
2. run more behave tests with psycopg2
2023-04-27 12:18:44 +02:00
Alexander Kukushkin
24af774adb Attempt to reduce behave flakiness on MacOS (#2645)
Sometimes MacOS workers are so slow that Postgres shutdown might take more than 30s-40s, what breaks a test with replica reinit in parallel with the primary restart, because basebackup() does only two attempts and in a pause the replica remains running with empty PGDATA.

In addition to that increase timeouts in ignore_slots test.
Close #2637
2023-04-13 12:21:08 +02:00
Polina Bungina
3fe2a7868a Ignore D401 in flake8-docstrings (#2627)
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2023-04-03 09:52:22 +02:00
Alexander Kukushkin
c1bfb0e6d6 Remove python 2.7 support (#2571)
- get rid from 2.7 specific modules: `six`, `ipaddress`
- use Python3 unpacking operator
- use `shutil.which()` instead of `find_executable()`
2023-03-13 17:00:04 +01:00
Alexander Kukushkin
95ba8b9e59 Fix bug with metadata after coordinator failover (#2597)
We made incorrect assumption that `citus_set_coordinator_host()` will trigger `pg_dist_node` sync. Instead we should also use `citus_update_node()` and call `citus_set_coordinator_host()` only during the bootstrap.

Adjust behave tests to verify that coordinator failover is visible on workers.
2023-03-13 13:30:39 +01:00
Polina Bungina
b85f155dbe Pass 'master' role to a callback script instead of 'promoted' (#2554)
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2023-02-08 14:09:51 +01:00
Alexander Kukushkin
7869f5e211 Release 3.0.0 (#2545)
* bump version
* update release notes
* removed 2.7, 3.4, 3.5, and 3.6 from supported versions in setup.py
* switched GH actions back to ubuntu-latest, removed tests with 2.7 and 3.6, and added 3.11
* some little fixes in Citus documentation and behave tests
2023-01-30 10:29:08 +01:00
Alexander Kukushkin
4c3af2d1a0 Change master->primary/leader/member (#2541)
keep as much backward compatibility as possible.

Following changes were made:
1. All internal checks are performed as `role in ('master', 'primary')`
2. All internal variables/functions/methods are renamed
3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`.
4. Logs are changed to use leader/primary/member/remote depending on the context
5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works.
6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`.
7. `master_(start|stop)_timeout` is automatically translated to `primary_(start|stop)_timeout` if the last one is not set.
8. updated the documentation and some examples

Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether.
The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.
2023-01-27 07:40:24 +01:00
Alexander Kukushkin
79458688d1 Check unexpected exceptions in Patroni logs after behave (#2538)
and make behave fail if there are anything unexpected found.

In addition to that fix globing rule when uploading artifacts with logs.
2023-01-25 11:02:52 +01:00
Alexander Kukushkin
4872ac51e0 Citus integration (#2504)
Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together:
```
/service/batman/
/service/batman/0/
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/
/service/batman/1/
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m1
/service/batman/1/members/m2
...
```

Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups.

Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper).

The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group.

Besides that we introduce a new method  get_citus_coordinator(). It will be used only by worker clusters.

Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates.
E.g.
```
batman-0-leader  # the leader config map for the coordinator
batman-0-config  # the config map holding initialize, config, and history "keys"
...
batman-1-leader  # the leader config map for worker group 1
batman-1-config
...
```

Citus integration is enabled from patroni.yaml:
```yaml
citus:
  database: citus
  group: 0  # 0 is for coordinator, 1, 2, etc are for workers
```

If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file.

When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section.

Besides that, Patroni takes over management of some Postgres GUCs:
- `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place
- `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2`
- wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal.

The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using
citus_add_node() and citus_update_node() functions.

Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries.
When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker.
Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked.
If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike.

All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.
2023-01-24 16:14:58 +01:00
Alexander Kukushkin
40d16443f9 Fixes and improvements in failsafe (#2532)
1. Fix problem with logical slots not advancing when only the primary lost access to DCS
2. Don't let Patroni to join as a raft voting member when running failsafe behave tests. It allows to test exactly the same conditions as for other DCS
3. Speed up dcs_failsafe_mode behave tests by getting rid from long sleeps, slight reshuffling of places when we start/stop outage, and by killing Patroni/Postgres to avoid long shutdown due to the leader key removal attempts.
2023-01-24 14:07:31 +01:00
Alexander Kukushkin
2ea0357854 DCS failsafe mode (#2379)
If enabled it will allow Patroni to cope with DCS outages.
In case of a DCS outage the leader tries to call all remaining members in the cluster via API and if all of them respond with success the leader will not be demoted.

The failsafe_mode could be enabled by running
```sh
patronictl edit-config -s failsafe_mode=true
```

or by calling the `/config` REST API endpoint.

Co-authored-by: Polina Bungina <bungina@gmail.com>
2023-01-13 13:35:05 +01:00
Alexander Kukushkin
5bbb5dceeb Improve /(a)sync checks in behave tests (#2521)
They are frequently failing because sometimes replicas are a bit slow realizing that they are synchronous. Instead of instroducing more sleeps we will poll for required http status code with some timeout.
2023-01-12 08:23:59 +01:00
Michael Banck
e3e4ad0ada Start etcd with V2 API enabled for V2 etcd acceptance tests (#2509)
Otherwise, the etcd (not etcd3) behave tests fail to connect:
```
Jan 02 09:56:18 HOOK-ERROR in before_all: AssertionError: etcd instance is not available for queries after 5 seconds
```
2023-01-03 15:39:30 +01:00
Alexander Kukushkin
49f1ccf874 Enable SSL in REST API and Postgres if possible when running behave (#2498)
If openssl binary is available use it to generate a self-signed certificate. Use it to protect Patroni REST API
(`verify_client: required`).

In case if Postgres is compiled with SSL support enable it in the configuration and configure pg_hba.conf to check client certificates (`verify-ca`) in addition to passwords. Also configure superuser/replication/rewind users to use client certificates and verify server certificate (`verify-ca`)
2022-12-21 10:20:30 +01:00
Alexander Kukushkin
4d77b444dc Enforce search_path=pg_catalog for non-replication connections (#2496)
There is a known [vector of attact](https://pganalyze.com/blog/5mins-postgres-security-patch-releases-pgspot-pghostile) by creating functions and/or operators in a public scheme with the same name and signature as corresponding objects in `pg_catalog`.

Since Patroni is heavily relying on superuser connections we want to mitigate it by enforcing `search_path=pg_catalog` for all connections created by Patroni (except replication connections). It is achieved by introducing a new function, that wraps psycopg.connect() and appends ` -c search_path=pg_catalog` to `options` parameter.

In addition to that, we set connection.autocommit to True before returning it.
2022-12-20 09:56:14 +01:00
Matt Baker
e5027c7a13 Ensure watchdog configuration matches bootstrap.dcs config and log changes (#2480)
Fix issue of patroni configuring watchdog with defaults when bootstrapping a new cluster rather than taking configuration used to bootstrap the DCS.
Also log changes to watchdog configuration based on calculated timeout value.

Close #2470
2022-12-13 16:59:23 +01:00
Alexander Kukushkin
c7a925a238 Switch from localkube to kind and/or k3d (#2465)
The only advantage of localkube was being a low weight. Anything else started creating only problems:
1. It is not properly maintained for many years.
2. It effectively worked only on Linux, but stopped on modern version due to changes in iptables.

Instead, we will use widely adoped tools like kind or k3s. The "kind-kind" is the default K8s context (see ~/.kube/config), but it could be overriden using `PATRONI_KUBERNETES_CONTEXT` environment variable. When executed from GH actions the context is set to k3d-k3s-default, because K3s is much faster to start.
2022-12-06 13:15:56 +01:00
Alexander Kukushkin
580530b30f Behave tests on Windows (#2432)
Windows doesn't support `SIGTERM`, but our behave tests in majority of cases relying on Patroni graceful shutdown.
In order to emulate the behaviour we introduced the new REST API endpoint `POST /sigterm`. The endpoint works only on Windows and when `BEHAVE_DEBUG` environment variable is set.
Besides that some minor adjustments in behave tests were done. Mainly related to backslash-slash handling.

In addition to that improve test coverage on Windows by properly mocking access to filesystem and avoiding calling
 `subprocess.call()`. Specifically, symlink creation on Windows requires Admin privileges and there is no `true.exe`.
2022-10-21 12:24:24 +02:00
Alexander Kukushkin
ead798d9ac Speed up behave tests by always using loop_wait=2 (#2361)
run time is reduced from ~5m30s to ~5m
2022-07-18 15:23:55 +02:00
Alexander Kukushkin
4c5cce5efd Automatically skip some behave tests on legacy Postgres (#2358)
previously behave had to be started with `--tags=-skip` argument.
2022-07-13 12:13:36 +02:00
Alexander Kukushkin
729f1dddc8 Compatibility with PostgreSQL 15 beta1 (#2299)
* update postgresql/validator.py
* pg_rewind doesn't like if there are unix sockets in PGDATA
* pg_rewind now supports --config-file option
2022-05-19 15:36:09 +02:00
Alexander Kukushkin
d3e3b4e16f Minor tuning of tests (#2201)
- Reduce verbosity for unit tests
- Refactor GH actions config and try again macos behave tests
2022-02-10 15:38:16 +01:00
Alexander Kukushkin
4215565cb4 Rearrange tests (#2146)
- remove codacy steps: they removed legacy organizations and there seems to be no easy way of installing codacy app to the Zalando GH.
- Don't run behave on MacOS: recently worker became way to slow
- Disable behave for combination of kubernetes and python 2.7
- Remove python 3.5 (it will be removed by GH from workers in January) and add 3.10
- Run behave with 3.6 and 3.9 instead of 3.5 and 3.8
2021-12-21 09:36:22 +01:00
Alexander Kukushkin
d24051c31c Optimize case when we don't have permanent logical slots (#2121)
The unnecessary call of SlotsHandler.process_permanent_slots() results in one additional query to `pg_replication_slots` view every HA loop.
2021-11-30 14:20:55 +01:00
Alexander Kukushkin
fce889cd04 Compatibility with psycopg 3.0 (#2088)
By default `psycopg2` is preferred. The `psycopg>=3.0` will be used only if `psycopg2` is not available or its version is too old.
2021-11-19 14:32:54 +01:00
Christian Clauss
75e52226a8 Fix typos discovered by codespell (#1997) 2021-07-06 10:01:30 +02:00
Alexander Kukushkin
f3420e2db5 Compatibility with PostgreSQL 14 (#1926)
PostgreSQL 14 changed the behavior of replicas when certain parameters (like for example `max_connections`) are changed (increased): https://github.com/postgres/postgres/commit/15251c0a.
Instead of immediately exiting Postgres 14 pauses replication and waits for actions from the operator.

Since the `pg_is_wal_replay_paused()` returning `True` is the only indicator of such a change, Patroni on the replica will call the `pg_wal_replay_resume()`, which would cause either continue replication or shutdown (like previously).

So far Patroni was never calling `pg_wal_replay_resume()` on its own, therefore, to remain backward compatible it will call it only for PostgreSQL 14+.
2021-06-25 13:41:45 +02:00
Alexander Kukushkin
99626a07f2 Fix issues with raft traffic encryption (#1919)
and run raft behave tests with encryption enabled.

Using the new `pysyncobj` release allowed us to get rid of a lot of hacks with accessing private properties and methods of the parent class and reduce the size of the `raft.py`.

Close https://github.com/zalando/patroni/issues/1746
2021-04-30 11:28:41 +02:00
melrifa
6d6b504cb8 Add support for patroni replication user socket connection (#1865)
Close #1866
2021-04-20 09:43:05 +02:00
Alexander Kukushkin
c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00
krishna
b3dc765e6d Choose synchronous nodes based on replication lag (#1786)
This commit makes it possible to configure the maximum lag (`maximum_lag_on_syncnode`) after which Patroni will "demote" the node from synchronous and replace it with another node.

The previous implementation always tried to stick to the same synchronous nodes (even if they are not optimal ones).
2021-02-02 15:45:02 +01:00
Alexander Kukushkin
89a15a2df4 Fix small issues with ignore-slots feature (#1797)
When there is no config key in DCS Patroni shouldn't try accessing ignore_slots, otherwise an exception is raised.

In addition to that implement missing unit-tests and fix linting issues in behave tests.
2020-12-16 18:10:12 +01:00
Alexander Kukushkin
e3ef9ac306 Fix issues with zookeeper (#1792)
1. The `ttl` was incorrectly returned 1000 times higher then it should
2. The `watch()` method must return True if the parent method returned True. Not doing so resulted in the incorrect calculation of sleep time.
3. Move mock of exhibitor api to the features/environment.py. It simplifies testing with behave.
2020-12-14 15:12:57 +01:00
Alexander Kukushkin
1530ed0b9c Switch to GH actions (#1778)
it allows up to 20 parallel builds
2020-12-04 21:52:34 +01:00
James Coleman
d7f579ee61 Feature: ability to ignore externally managed replication slots (#1742)
There are sometimes good reasons to manage replication slots externally
to Patroni. For example, a consumer may wish to manage its own slots (so
that it can more easily track when a failover has a occurred and whether
it is ahead of or behind the WAL position on the new primary).
Additionally tooling like pglogical actually replicates slots to all
replicas so that the current position can be maintained on failover
targets (this also aids consumers by supplying primitives so that they
can verify data hasn't been lost or a split brain occurred relative to
the physical cluster).

To support these use cases this new feature allows configuring Patroni
to entirely ignore sets of slots specified by any subset of name,
database, slot type, and plugin.
2020-11-24 11:45:14 +01:00