Files
patroni/features/recovery.feature
Alexander Kukushkin 6e96db173f Start postgres not in recovery in some cases (#2726)
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720.

In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`.

P.S. now we do not cover cases when Patroni was killed along with Postgres.
Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
2023-07-12 09:42:34 +02:00

25 lines
1.1 KiB
Gherkin

Feature: recovery
We want to check that crashed postgres is started back
Scenario: check that timeline is not incremented when primary is started after crash
Given I start postgres0
Then postgres0 is a leader after 10 seconds
And there is a non empty initialize key in DCS after 15 seconds
When I start postgres1
And I add the table foo to postgres0
Then table foo is present on postgres1 after 20 seconds
When I kill postmaster on postgres0
Then postgres0 role is the primary after 10 seconds
When I issue a GET request to http://127.0.0.1:8008/
Then I receive a response code 200
And I receive a response role master
And I receive a response timeline 1
Scenario: check immediate failover when master_start_timeout=0
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"master_start_timeout": 0}
Then I receive a response code 200
And Response on GET http://127.0.0.1:8008/config contains master_start_timeout after 10 seconds
When I kill postmaster on postgres0
Then postgres1 is a leader after 10 seconds
And postgres1 role is the primary after 10 seconds