mirror of
https://github.com/outbackdingo/patroni.git
synced 2026-01-27 10:20:10 +00:00
Effectively, this PR consists of a few changes: 1. The easy part: In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots. In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster. 2. The tricky part: On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function. When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward. 3. Additional requirements: In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots. 4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use. Compatibility. The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary. The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed. Close: https://github.com/zalando/patroni/issues/1749
61 lines
3.3 KiB
Gherkin
61 lines
3.3 KiB
Gherkin
Feature: standby cluster
|
|
Scenario: prepare the cluster with logical slots
|
|
Given I start postgres1
|
|
Then postgres1 is a leader after 10 seconds
|
|
And there is a non empty initialize key in DCS after 15 seconds
|
|
When I issue a PATCH request to http://127.0.0.1:8009/config with {"loop_wait": 2, "slots": {"pm_1": {"type": "physical"}}, "postgresql": {"parameters": {"wal_level": "logical"}}}
|
|
Then I receive a response code 200
|
|
And Response on GET http://127.0.0.1:8009/config contains slots after 10 seconds
|
|
And I sleep for 3 seconds
|
|
When I issue a PATCH request to http://127.0.0.1:8009/config with {"slots": {"test_logical": {"type": "logical", "database": "postgres", "plugin": "test_decoding"}}}
|
|
Then I receive a response code 200
|
|
And I do a backup of postgres1
|
|
When I start postgres0
|
|
Then "members/postgres0" key in DCS has state=running after 10 seconds
|
|
And replication works from postgres1 to postgres0 after 15 seconds
|
|
|
|
@skip
|
|
Scenario: check permanent logical slots are synced to the replica
|
|
Given I run patronictl.py restart batman postgres1 --force
|
|
Then Logical slot test_logical is in sync between postgres0 and postgres1 after 10 seconds
|
|
When I add the table replicate_me to postgres1
|
|
And I get all changes from logical slot test_logical on postgres1
|
|
Then Logical slot test_logical is in sync between postgres0 and postgres1 after 10 seconds
|
|
|
|
Scenario: Detach exiting node from the cluster
|
|
When I shut down postgres1
|
|
Then postgres0 is a leader after 10 seconds
|
|
And "members/postgres0" key in DCS has role=master after 3 seconds
|
|
When I issue a GET request to http://127.0.0.1:8008/
|
|
Then I receive a response code 200
|
|
|
|
Scenario: check replication of a single table in a standby cluster
|
|
Given I start postgres1 in a standby cluster batman1 as a clone of postgres0
|
|
Then postgres1 is a leader of batman1 after 10 seconds
|
|
When I add the table foo to postgres0
|
|
Then table foo is present on postgres1 after 20 seconds
|
|
And I sleep for 3 seconds
|
|
When I issue a GET request to http://127.0.0.1:8009/master
|
|
Then I receive a response code 503
|
|
When I issue a GET request to http://127.0.0.1:8009/standby_leader
|
|
Then I receive a response code 200
|
|
And I receive a response role standby_leader
|
|
And there is a postgres1_cb.log with "on_role_change standby_leader batman1" in postgres1 data directory
|
|
When I start postgres2 in a cluster batman1
|
|
Then postgres2 role is the replica after 24 seconds
|
|
And table foo is present on postgres2 after 20 seconds
|
|
And postgres1 does not have a logical replication slot named test_logical
|
|
|
|
Scenario: check failover
|
|
When I kill postgres1
|
|
And I kill postmaster on postgres1
|
|
Then postgres2 is replicating from postgres0 after 32 seconds
|
|
When I issue a GET request to http://127.0.0.1:8010/master
|
|
Then I receive a response code 503
|
|
And I sleep for 3 seconds
|
|
When I issue a GET request to http://127.0.0.1:8010/standby_leader
|
|
Then I receive a response code 200
|
|
And I receive a response role standby_leader
|
|
And replication works from postgres0 to postgres2 after 15 seconds
|
|
And there is a postgres2_cb.log with "on_start replica batman1\non_role_change standby_leader batman1" in postgres2 data directory
|