mirror of
https://github.com/outbackdingo/patroni.git
synced 2026-01-27 10:20:10 +00:00
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720. In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`. P.S. now we do not cover cases when Patroni was killed along with Postgres. Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
25 lines
1.1 KiB
Gherkin
25 lines
1.1 KiB
Gherkin
Feature: recovery
|
|
We want to check that crashed postgres is started back
|
|
|
|
Scenario: check that timeline is not incremented when primary is started after crash
|
|
Given I start postgres0
|
|
Then postgres0 is a leader after 10 seconds
|
|
And there is a non empty initialize key in DCS after 15 seconds
|
|
When I start postgres1
|
|
And I add the table foo to postgres0
|
|
Then table foo is present on postgres1 after 20 seconds
|
|
When I kill postmaster on postgres0
|
|
Then postgres0 role is the primary after 10 seconds
|
|
When I issue a GET request to http://127.0.0.1:8008/
|
|
Then I receive a response code 200
|
|
And I receive a response role master
|
|
And I receive a response timeline 1
|
|
|
|
Scenario: check immediate failover when master_start_timeout=0
|
|
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"master_start_timeout": 0}
|
|
Then I receive a response code 200
|
|
And Response on GET http://127.0.0.1:8008/config contains master_start_timeout after 10 seconds
|
|
When I kill postmaster on postgres0
|
|
Then postgres1 is a leader after 10 seconds
|
|
And postgres1 role is the primary after 10 seconds
|