Files
patroni/features/basic_replication.feature
krishna b3dc765e6d Choose synchronous nodes based on replication lag (#1786)
This commit makes it possible to configure the maximum lag (`maximum_lag_on_syncnode`) after which Patroni will "demote" the node from synchronous and replace it with another node.

The previous implementation always tried to stick to the same synchronous nodes (even if they are not optimal ones).
2021-02-02 15:45:02 +01:00

97 lines
4.7 KiB
Gherkin

Feature: basic replication
We should check that the basic bootstrapping, replication and failover works.
Scenario: check replication of a single table
Given I start postgres0
Then postgres0 is a leader after 10 seconds
And there is a non empty initialize key in DCS after 15 seconds
When I issue a PATCH request to http://127.0.0.1:8008/config with {"ttl": 20, "loop_wait": 2, "synchronous_mode": true}
Then I receive a response code 200
When I start postgres1
And I configure and start postgres2 with a tag replicatefrom postgres0
And "sync" key in DCS has leader=postgres0 after 20 seconds
And I add the table foo to postgres0
Then table foo is present on postgres1 after 20 seconds
Then table foo is present on postgres2 after 20 seconds
Scenario: check restart of sync replica
Given I shut down postgres2
Then "sync" key in DCS has sync_standby=postgres1 after 5 seconds
When I start postgres2
And I shut down postgres1
Then "sync" key in DCS has sync_standby=postgres2 after 10 seconds
When I start postgres1
And "members/postgres1" key in DCS has state=running after 10 seconds
And I sleep for 2 seconds
When I issue a GET request to http://127.0.0.1:8010/sync
Then I receive a response code 200
When I issue a GET request to http://127.0.0.1:8009/async
Then I receive a response code 200
Scenario: check stuck sync replica
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"maximum_lag_on_syncnode": 15000000, "postgresql": {"parameters": {"synchronous_commit": "remote_apply"}}}
Then I receive a response code 200
And I create table on postgres0
And table mytest is present on postgres1 after 2 seconds
And table mytest is present on postgres2 after 2 seconds
When I pause wal replay on postgres2
And I load data on postgres0
Then "sync" key in DCS has sync_standby=postgres1 after 15 seconds
And I resume wal replay on postgres2
And I sleep for 2 seconds
And I issue a GET request to http://127.0.0.1:8009/sync
Then I receive a response code 200
When I issue a GET request to http://127.0.0.1:8010/async
Then I receive a response code 200
When I issue a PATCH request to http://127.0.0.1:8008/config with {"maximum_lag_on_syncnode": -1, "postgresql": {"parameters": {"synchronous_commit": "on"}}}
Then I receive a response code 200
And I drop table on postgres0
Scenario: check multi sync replication
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"synchronous_node_count": 2}
Then I receive a response code 200
And I sleep for 10 seconds
Then "sync" key in DCS has sync_standby=postgres1,postgres2 after 5 seconds
When I issue a GET request to http://127.0.0.1:8010/sync
Then I receive a response code 200
When I issue a GET request to http://127.0.0.1:8009/sync
Then I receive a response code 200
When I issue a PATCH request to http://127.0.0.1:8008/config with {"synchronous_node_count": 1}
Then I receive a response code 200
And I shut down postgres1
And I sleep for 10 seconds
Then "sync" key in DCS has sync_standby=postgres2 after 10 seconds
When I start postgres1
And "members/postgres1" key in DCS has state=running after 10 seconds
When I issue a GET request to http://127.0.0.1:8010/sync
Then I receive a response code 200
When I issue a GET request to http://127.0.0.1:8009/async
Then I receive a response code 200
Scenario: check the basic failover in synchronous mode
Given I run patronictl.py pause batman
Then I receive a response returncode 0
When I sleep for 2 seconds
And I shut down postgres0
And I run patronictl.py resume batman
Then I receive a response returncode 0
And postgres2 role is the primary after 24 seconds
And Response on GET http://127.0.0.1:8010/history contains recovery after 10 seconds
When I issue a PATCH request to http://127.0.0.1:8010/config with {"synchronous_mode": null, "master_start_timeout": 0}
Then I receive a response code 200
When I add the table bar to postgres2
Then table bar is present on postgres1 after 20 seconds
And Response on GET http://127.0.0.1:8010/config contains master_start_timeout after 10 seconds
Scenario: check immediate failover when master_start_timeout=0
Given I kill postmaster on postgres2
Then postgres1 is a leader after 10 seconds
And postgres1 role is the primary after 10 seconds
Scenario: check rejoin of the former master with pg_rewind
Given I add the table splitbrain to postgres0
And I start postgres0
Then postgres0 role is the secondary after 20 seconds
When I add the table buz to postgres1
Then table buz is present on postgres0 after 20 seconds