Files
patroni/features/standby_cluster.feature
Alexander Kukushkin c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00

61 lines
3.3 KiB
Gherkin

Feature: standby cluster
Scenario: prepare the cluster with logical slots
Given I start postgres1
Then postgres1 is a leader after 10 seconds
And there is a non empty initialize key in DCS after 15 seconds
When I issue a PATCH request to http://127.0.0.1:8009/config with {"loop_wait": 2, "slots": {"pm_1": {"type": "physical"}}, "postgresql": {"parameters": {"wal_level": "logical"}}}
Then I receive a response code 200
And Response on GET http://127.0.0.1:8009/config contains slots after 10 seconds
And I sleep for 3 seconds
When I issue a PATCH request to http://127.0.0.1:8009/config with {"slots": {"test_logical": {"type": "logical", "database": "postgres", "plugin": "test_decoding"}}}
Then I receive a response code 200
And I do a backup of postgres1
When I start postgres0
Then "members/postgres0" key in DCS has state=running after 10 seconds
And replication works from postgres1 to postgres0 after 15 seconds
@skip
Scenario: check permanent logical slots are synced to the replica
Given I run patronictl.py restart batman postgres1 --force
Then Logical slot test_logical is in sync between postgres0 and postgres1 after 10 seconds
When I add the table replicate_me to postgres1
And I get all changes from logical slot test_logical on postgres1
Then Logical slot test_logical is in sync between postgres0 and postgres1 after 10 seconds
Scenario: Detach exiting node from the cluster
When I shut down postgres1
Then postgres0 is a leader after 10 seconds
And "members/postgres0" key in DCS has role=master after 3 seconds
When I issue a GET request to http://127.0.0.1:8008/
Then I receive a response code 200
Scenario: check replication of a single table in a standby cluster
Given I start postgres1 in a standby cluster batman1 as a clone of postgres0
Then postgres1 is a leader of batman1 after 10 seconds
When I add the table foo to postgres0
Then table foo is present on postgres1 after 20 seconds
And I sleep for 3 seconds
When I issue a GET request to http://127.0.0.1:8009/master
Then I receive a response code 503
When I issue a GET request to http://127.0.0.1:8009/standby_leader
Then I receive a response code 200
And I receive a response role standby_leader
And there is a postgres1_cb.log with "on_role_change standby_leader batman1" in postgres1 data directory
When I start postgres2 in a cluster batman1
Then postgres2 role is the replica after 24 seconds
And table foo is present on postgres2 after 20 seconds
And postgres1 does not have a logical replication slot named test_logical
Scenario: check failover
When I kill postgres1
And I kill postmaster on postgres1
Then postgres2 is replicating from postgres0 after 32 seconds
When I issue a GET request to http://127.0.0.1:8010/master
Then I receive a response code 503
And I sleep for 3 seconds
When I issue a GET request to http://127.0.0.1:8010/standby_leader
Then I receive a response code 200
And I receive a response role standby_leader
And replication works from postgres0 to postgres2 after 15 seconds
And there is a postgres2_cb.log with "on_start replica batman1\non_role_change standby_leader batman1" in postgres2 data directory