Files
patroni/features/dcs_failsafe_mode.feature
Alexander Kukushkin 4c3af2d1a0 Change master->primary/leader/member (#2541)
keep as much backward compatibility as possible.

Following changes were made:
1. All internal checks are performed as `role in ('master', 'primary')`
2. All internal variables/functions/methods are renamed
3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`.
4. Logs are changed to use leader/primary/member/remote depending on the context
5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works.
6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`.
7. `master_(start|stop)_timeout` is automatically translated to `primary_(start|stop)_timeout` if the last one is not set.
8. updated the documentation and some examples

Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether.
The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.
2023-01-27 07:40:24 +01:00

86 lines
4.2 KiB
Gherkin

Feature: dcs failsafe mode
We should check the basic dcs failsafe mode functioning
Scenario: check failsafe mode can be successfully enabled
Given I start postgres0
And postgres0 is a leader after 10 seconds
And I sleep for 3 seconds
When I issue a PATCH request to http://127.0.0.1:8008/config with {"loop_wait": 2, "ttl": 20, "retry_timeout": 5, "failsafe_mode": true}
Then I receive a response code 200
And Response on GET http://127.0.0.1:8008/failsafe contains postgres0 after 10 seconds
When I issue a GET request to http://127.0.0.1:8008/failsafe
Then I receive a response code 200
And I receive a response postgres0 http://127.0.0.1:8008/patroni
When I issue a PATCH request to http://127.0.0.1:8008/config with {"postgresql": {"parameters": {"wal_level": "logical"}}}
Then I receive a response code 200
When I issue a PATCH request to http://127.0.0.1:8008/config with {"slots": {"dcs_slot_0": {"type": "logical", "database": "postgres", "plugin": "test_decoding"}}}
Then I receive a response code 200
@dcs-failsafe
Scenario: check one-node cluster is functioning while DCS is down
Given DCS is down
Then Response on GET http://127.0.0.1:8008/primary contains failsafe_mode_is_active after 12 seconds
And postgres0 role is the primary after 10 seconds
@dcs-failsafe
Scenario: check new replica isn't promoted when leader is down and DCS is up
Given DCS is up
When I do a backup of postgres0
And I shut down postgres0
When I start postgres1 in a cluster batman from backup with no_leader
And I sleep for 2 seconds
Then postgres1 role is the replica after 12 seconds
Scenario: check leader and replica are both in /failsafe key after leader is back
Given I start postgres0
And I start postgres1
Then "members/postgres0" key in DCS has state=running after 10 seconds
And "members/postgres1" key in DCS has state=running after 2 seconds
And Response on GET http://127.0.0.1:8009/failsafe contains postgres1 after 10 seconds
When I issue a GET request to http://127.0.0.1:8009/failsafe
Then I receive a response code 200
And I receive a response postgres0 http://127.0.0.1:8008/patroni
And I receive a response postgres1 http://127.0.0.1:8009/patroni
@dcs-failsafe
@slot-advance
Scenario: check leader and replica are functioning while DCS is down
Given logical slot dcs_slot_0 is in sync between postgres0 and postgres1 after 10 seconds
And DCS is down
Then Response on GET http://127.0.0.1:8008/primary contains failsafe_mode_is_active after 12 seconds
Then postgres0 role is the primary after 10 seconds
And postgres1 role is the replica after 2 seconds
And replication works from postgres0 to postgres1 after 10 seconds
And I get all changes from logical slot dcs_slot_0 on postgres0
And logical slot dcs_slot_0 is in sync between postgres0 and postgres1 after 20 seconds
@dcs-failsafe
Scenario: check primary is demoted when one replica is shut down and DCS is down
Given DCS is down
And I kill postgres1
And I kill postmaster on postgres1
And I sleep for 2 seconds
Then postgres0 role is the replica after 12 seconds
@dcs-failsafe
Scenario: check known replica is promoted when leader is down and DCS is up
Given I shut down postgres0
And DCS is up
When I start postgres1
Then "members/postgres1" key in DCS has state=running after 10 seconds
And postgres1 role is the primary after 25 seconds
@dcs-failsafe
Scenario: check three-node cluster is functioning while DCS is down
Given I start postgres0
And I start postgres2
Then "members/postgres2" key in DCS has state=running after 10 seconds
And "members/postgres0" key in DCS has state=running after 20 seconds
And Response on GET http://127.0.0.1:8008/failsafe contains postgres2 after 10 seconds
And replication works from postgres1 to postgres0 after 10 seconds
Given DCS is down
Then Response on GET http://127.0.0.1:8008/primary contains failsafe_mode_is_active after 12 seconds
Then postgres1 role is the primary after 10 seconds
And postgres0 role is the replica after 2 seconds
And postgres2 role is the replica after 2 seconds