mirror of
https://github.com/lingble/patroni.git
synced 2026-03-20 04:02:19 +00:00
Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions. To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over. We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions. On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS. When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock. Added acceptance tests and documentation. Implementation by @ants with extensive review by @CyberDem0n.
41 lines
1.8 KiB
Gherkin
41 lines
1.8 KiB
Gherkin
Feature: basic replication
|
|
We should check that the basic bootstrapping, replication and failover works.
|
|
|
|
Scenario: check replication of a single table
|
|
Given I start postgres0
|
|
Then postgres0 is a leader after 10 seconds
|
|
When I issue a PATCH request to http://127.0.0.1:8008/config with {"ttl": 20, "loop_wait": 2, "synchronous_mode": true}
|
|
Then I receive a response code 200
|
|
When I start postgres1
|
|
And I configure and start postgres2 with a tag replicatefrom postgres0
|
|
And "sync" key in DCS has leader=postgres0 after 20 seconds
|
|
And I add the table foo to postgres0
|
|
Then table foo is present on postgres1 after 20 seconds
|
|
Then table foo is present on postgres2 after 20 seconds
|
|
|
|
Scenario: check restart of sync replica
|
|
Given I run patronictl.py restart batman postgres2 --force
|
|
And "sync" key in DCS has sync_standby=postgres1 after 2 seconds
|
|
And I run patronictl.py restart batman postgres1 --force
|
|
Then I receive a response returncode 0
|
|
And "sync" key in DCS has sync_standby=postgres2 after 10 seconds
|
|
|
|
Scenario: check the basic failover in synchronous mode
|
|
When I kill postgres0
|
|
Then postgres2 role is the primary after 22 seconds
|
|
When I issue a PATCH request to http://127.0.0.1:8009/config with {"synchronous_mode": null}
|
|
Then I receive a response code 200
|
|
When I add the table bar to postgres2
|
|
Then table bar is present on postgres1 after 20 seconds
|
|
|
|
Scenario: check the basic failover
|
|
Given I shut down postgres2
|
|
Then postgres1 is a leader after 10 seconds
|
|
And postgres1 role is the primary after 10 seconds
|
|
|
|
Scenario: check rejoin of the former master with pg_rewind
|
|
Given I start postgres0
|
|
Then postgres0 role is the secondary after 20 seconds
|
|
When I add the table buz to postgres1
|
|
Then table buz is present on postgres0 after 20 seconds
|