mirror of
https://github.com/outbackdingo/patroni.git
synced 2026-01-28 02:20:04 +00:00
It is not very common, but the master Postgres might "crash" due to different reasons, like OOM, or out of disk space. Of course, there are chances that the current node holds some unreplicated data and therefore Patroni by default prefers to start Postgres on the leader node rather than doing a failover. In order to be on the safe side Patroni always starts Postgres in recovery no matter whether the current node owns the leader lock or not. If the Postgres wasn't shut down cleanly, starting in recovery might fail, therefore in some cases as a workaround Patroni is executing a crash recovery by starting the postgres up in the single-user mode. A few times we end up in the situation: 1. Master postgres crashed due to the out of disk space 2. Patroni starts crash recovery in a single-user mode 3. While doing crash-recovery Patroni keeps updating the leader lock It makes Patroni stuck on step 3 and the manual intervention is required for recovering the cluster. Patroni already has the option `master_start_timeout`, which controls for how long we let postgres stay in the `starting` state and after that Patroni might decide to release the leader lock if there are healthy replicas available which could take it over. This PR makes the `master_start_timeout` option also work for crash recovery.