187 Commits

Author SHA1 Message Date
Alexander Kukushkin
496d14e6ca Better handling of failed pg_rewind attempt (#2304)
Close #2302
2022-05-19 14:52:26 +02:00
Alexander Kukushkin
5f6197aaad Don't copy logical slot if there is mismatch with the config (#2274)
A couple of times we have seen in the wild that the database for the permanent logical slots was changed in the Patroni config.

It resulted in the below situation.
On the primary:
1. The slot must be dropped before creating it in a different DB.
2. Patroni fails to drop it because the slot is in use.

Replica:
1. Patroni notice that the slot exists in the wrong DB and successfully dropping it.
2. Patroni copying the existing slot from the primary by its name with Postgres restart.

And the loop repeats while the "wrong" slot exists on the primary.

Basically, replicas are continuously restarting, which badly affects availability.

In order to solve the problem, we will perform additional checks while copying replication slot files from the primary and discard them if `slot_type`, `database`, or `plugin` don't match our expectations.
2022-04-14 12:10:37 +02:00
Alexander Kukushkin
81912c9cae Handle rewind when demoted node was shut down (#2252)
In case of DCS unavailability Patroni restarts Postgres in read-only.
It will cause pg_control to be updated with the `Database cluster state: in archive recovery` and also could set the `MinRecoveryPoint`.

When the next time Patroni is started it will assume that Postgres was running as a replica and rewind isn't required and will try to start the Postgres up. In this situation there is the chance that the start will be aborted with the FATAL error message that looks like `requested timeline 2 does not contain minimum recovery point 0/501E8B8 on timeline 1`.
On the next heart-beat Patroni will again notice that Postgres isn't running which would lead to another start-fail attempt.

This loop is endless.

In order to mitigate the problem we do the following:
1. While figuring out whether the rewind is required we consider `in archive recovery` along with `shut down in recovery`.
2. If pg_rewind is required and the cluster state is `in archive recovery` we also perform recovery in a single-user mode.

Close https://github.com/zalando/patroni/issues/2242
2022-03-24 13:51:59 +01:00
Alexander Kukushkin
fce889cd04 Compatibility with psycopg 3.0 (#2088)
By default `psycopg2` is preferred. The `psycopg>=3.0` will be used only if `psycopg2` is not available or its version is too old.
2021-11-19 14:32:54 +01:00
Alexander Kukushkin
89388c2e4b Handle DCS exceptions when demoting (#2081)
While doing demote due to failure to update leader lock it could happen that DCS goes completely down and the get_cluster() call raise the exception.
Not being properly handled it results in postgres remaining stopped until DCS recovers.
2021-10-07 16:08:10 +02:00
Alexander Kukushkin
d394b63c9f Release the leader lock when pg_controldata reports "shut down" (#2067)
Due to different reasons, it could happen that WAL archiving on the primary stuck or significantly delayed. If we try to do a switchover or shut it down, the shutdown will take forever and will not finish until the whole backlog of WALs is processed.
In the meantime, Patroni keeps updating the leader lock, which prevents other nodes from starting the leader race even if it is known that they received/applied all changes.

The `Database cluster state:` is changed to `"shut down"` after:
- all data is fsynced to disk and the latest checkpoint is written to WAL
- all streaming replicas confirmed that they received all changes (including the latest checkpoint)
- at the same time, the archiver process continues to do its job and the postmaster process is still running.

In order to solve this problem and make the switchover more reliable/fast in a case when `archive_command` is slow/failing, Patroni will remove the leader key immediately after `pg_controldata` started reporting PGDATA as `"shut down"` cleanly and it verified that there is at least one replica that received all changes. If there are no replicas that fulfill the condition the leader key isn't removed and the old behavior is retained, i.e. Patroni will keep updating it.
2021-10-05 10:55:35 +02:00
Alexander Kukushkin
1c2bf258d6 Allow switchover only to sync nodes when synchronous replication is on (#2076)
Close https://github.com/zalando/patroni/issues/2074
2021-10-04 16:23:45 +02:00
Michael Banck
2f31e88bdc Add dcs_last_seen field to API (#2051)
This field notes the last time (as unix epoch) a cluster member has successfully communicated with the DCS. This is useful to identify and/or analyze network partitions.

Also, expose dcs_last_seen in the MemberStatus class and its from_api_response() method.
2021-09-22 10:01:35 +02:00
Michael Banck
fae96b3148 Improve "I am" status messages (#2056) 2021-09-17 14:46:07 +02:00
Alexander Kukushkin
93efa91bbd Release 2.1.1 (#2039)
* Update release notes
* Bump version
* Improve unit-test coverage
2021-08-19 15:44:37 +02:00
Christian Clauss
75e52226a8 Fix typos discovered by codespell (#1997) 2021-07-06 10:01:30 +02:00
Alexander Kukushkin
333d292eb3 Handle DNS issues in Raft implementation (#1960)
- Resolve Node IP for every connection attempt
- Handle exception with connection failures due to failed resolve
- Set PySyncObj DNS Cache timeouts aligned with `loop_wait` and `ttl`

In addition to that,  postpone the leader race for freshly started Raft nodes. It will help with the situation when the leader node was alone and demoted the Postgres and after that, the replica arrives, and quickly takes the leader lock without really performing the leader race.

Close https://github.com/zalando/patroni/issues/1930, https://github.com/zalando/patroni/issues/1931
2021-07-05 09:30:31 +02:00
Alexander Kukushkin
0ceb59b49d Write prev LSN to before checkpoint to optime if wal_achive=on (#1889)
The #1527 introduced a feature of updating `/optime/leader` with the location of the last checkpoint after the Postgres was shutdown cleanly.

If wal archiving is enabled, Postgres always switching the WAL file before writing the checkpoint shutdown record. Normally it is not an issue, but for databases without too much write activity it could lead to the situation that the visible replication lag becomes equal to the size of a single WAL file. In fact, the previous WAL file is mostly empty and contains only a few records.

Therefore it should be safe to report the LSN of the SWITCH record before the shutdown checkpoint.
In order to do that, Patroni first gets the output of the pg_controldata and based on it calls pg_waldump two times:
* The first call reads the checkpoint record (and verifies that this is really the shutdown checkpoint).
* The next call reads the previous record and in case if it is the 'xlog switch' (for 9.3 and 9.4) or 'SWITCH' (for 9.5+), the LSN
of the SWITCH record is written to the `/optime/leader`.

In case of any mismatch, failure to call pg_waldump or parse its output, the old behavior is retained, i.e. `Latest checkpoint location` from the pg_controldata is used.

Close https://github.com/zalando/patroni/issues/1860
2021-07-05 09:29:39 +02:00
Florian Bütler
e2d8a7d086 fix minor typo (#1991)
close #1990
2021-07-02 08:27:17 +02:00
Alexander Kukushkin
f403719bb4 Reduce chattiness of Patroni logs (#1955)
1. When everything goes normal, only one line will be written for every run of HA loop (see examples):
```
INFO: no action. I am (postgresql0) the leader with the lock
INFO: no action. I am a secondary (postgresql1) and following a leader (postgresql0)
```

2. The `does not have lock` became a debug message.
3. The `Lock owner: postgresql0; I am postgresql1` will be shown only when stream doesn't look normal.
2021-06-22 09:13:30 +02:00
Alexander Kukushkin
c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00
Alexander Kukushkin
e8e87bf0a1 Don't interrupt restart or promote if lost leader lock in pause (#1726)
In pause it is allowed to run postgres as master without lock.
2020-10-08 08:56:53 +02:00
Alexander Kukushkin
fa88d80c4f Apply master_start_timeout when executing crash recovery (#1720)
It is not very common, but the master Postgres might "crash" due to different reasons, like OOM, or out of disk space. Of course, there are chances that the current node holds some unreplicated data and therefore Patroni by default prefers to start Postgres on the leader node rather than doing a failover.

In order to be on the safe side Patroni always starts Postgres in recovery no matter whether the current node owns the leader lock or not. If the Postgres wasn't shut down cleanly, starting in recovery might fail, therefore in some cases as a workaround Patroni is executing a crash recovery by starting the postgres up in the single-user mode.

A few times we end up in the situation:
1. Master postgres crashed due to the out of disk space
2. Patroni starts crash recovery in a single-user mode
3. While doing crash-recovery Patroni keeps updating the leader lock

It makes Patroni stuck on step 3 and the manual intervention is required for recovering the cluster.

Patroni already has the option `master_start_timeout`, which controls for how long we let postgres stay in the `starting` state and after that Patroni might decide to release the leader lock if there are healthy replicas available which could take it over.

This PR makes the `master_start_timeout` option also work for crash recovery.
2020-09-30 08:04:27 +02:00
Alexander Kukushkin
2c5d62bf10 Workaround unittest bug and fix requirements (#1718)
* unittest bug: https://bugs.python.org/issue25532
* `urllib3[secure]` wrongly depends on `ipaddress` for python3, while in fact we don't need all dependencies of the `secure` extra, but only `ipaddress` for the `kubernetes` on python2.7 

Close https://github.com/zalando/patroni/issues/1717
Close https://github.com/zalando/patroni/issues/1709
2020-09-29 15:15:58 +02:00
Alexander Kukushkin
8a8409999d Change the behavior in pause (#1687)
1. Don't call bootstrap if PGDATA is missing/empty, because it might be for purpose, and someone/something working on it.
2. Consider postgres running as a leader in pause not healthy if pg_control sysid doesn't match with the /initialize key (empty initialize key will allow the "race" and the leader will "restore" initialize key).
3. Don't exit on sysid mismatch in pause, only log a warning.
4. Cover corner cases when Patroni started in pause with empty PGDATA and it was restored by somebody else
5. Empty string is a valid `recovery_target`.
2020-09-18 08:25:00 +02:00
Alexander Kukushkin
4dd902fbf1 Fix bug in kubernetes.update_leader (#1685)
Unhandled exception prevented demoting the primary.
In addition to that wrap the update_leader call in the HA loop into try..except block and implement a test case.

Fixes https://github.com/zalando/patroni/issues/1684
2020-09-11 10:19:03 +02:00
Sergey Dudoladov
950eff27ad Optional fencing script (pre_promote) (#1099)
Call a fencing script after acquiring the leader lock. If the script didn't finish successfully - don't promote but remove leader key

Close https://github.com/zalando/patroni/issues/1567
2020-09-01 07:50:39 +02:00
Alexander Kukushkin
3e553df69d BUGFIX: pause on K8s (#1659)
On K8s the `Cluster.leader` is a valid object even if the cluster has no leader because we need to know the `resourceVersion` for future CAS operation. Such a non-empty object broke HA loop and made other nodes to think that the leader is there.

The right way to identify the missing leader which reliably works across all DCS is checking that the leader's name is empty.
2020-08-24 16:35:46 +02:00
ksarabu1
1ab709c5f0 Multi Sync Standby Support (#1594)
The new parameter `synchronous_node_count` is used by Patroni to manage number of synchronous standby databases. It is set to 1 by default. It has no effect when synchronous_mode is set to off. When enabled, Patroni manages precise number of synchronous standby databases based on parameter synchronous_node_count and adjusts the state in DCS & synchronous_standby_names as members join and leave.

This functionality can be further extended to support Priority (FIRST n) based synchronous replication & Quorum (ANY n) based synchronous replication in future.
2020-08-14 11:51:07 +02:00
Alexander Kukushkin
3341c898ff Add Etcd v3 protocol support via api gRPC-gateway (#1162)
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.

Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.

Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.

Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
2020-07-31 14:33:40 +02:00
Alexander Kukushkin
59dae9b1bb A few little bug-fixes (#1623)
1. Put the `*` character into pgpass if the actual value is empty
2. Re-raise fatal exception from the HA loop (we need to exit if for example cluster initialization failed)

Close https://github.com/zalando/patroni/issues/1617
2020-07-28 08:37:04 +02:00
Alexander Kukushkin
8eb01c77b6 Don't fire on_reload when promoting to standby_leader on 13+ (#1552)
PostgreSQL 13 finally introduced the possibility to change the `primary_conninfo` without a restart. Just doing reload is enough, but in case if the role is changing from the `replica` to the `standby_leader` we want to call only `on_role_change` callback and skip `on_reload`, because they duplicate each other.
2020-06-29 14:49:25 +02:00
Alexander Kukushkin
cd1b2741fa Improve timeline divergence check (#1563)
We don't need to rewind when:
1. replayed location for the former replica is not ahead of switchpoint
2. end of checkpoint record for the former primary is the same as switchpoint

In order to get the end of checkpoint record we use the `pg_waldump` and parse its output.

Close https://github.com/zalando/patroni/issues/1493
2020-05-29 14:15:10 +02:00
Alexander Kukushkin
98c2081c67 Detect a new timeline in the standby cluster (#1522)
The standby cluster doesn't know about leader elections in the main cluster and therefore the usual mechanisms of detecting divergences don't work. For example, it could happen that the standby cluster is ahead of the new primary of the main cluster and must be rewound.
There is a way to know that the new timeline has been created by checking the presence of a history file in pg_wal. If the new file is there, we will start usual procedures of making sure that we can continue streaming or will run the pg_rewind.
2020-05-29 14:14:47 +02:00
Alexander Kukushkin
c6207933d1 Properly handle the exception raised from refresh_session (#1531)
The `touch_member()` could be called from the finally block of the `_run_cycle()`. In case if it raised an exception the whole Patroni process was crashing.
In order to avoid future crashes we wrap `_run_cycle()` into the try..except block and ask a user to report a BUG.

Close https://github.com/zalando/patroni/issues/1529
2020-05-29 14:14:11 +02:00
Alexander Kukushkin
ad5c686c11 Take advantage of pg_stat_wal_recevier (#1513)
So far Patroni was parsing `recovery.conf` or querying `pg_settings` in order to get the current values of recovery parameters. On PostgreSQL earlier than 12 it could easily happen that the value of `primary_conninfo` in the `recovery.conf` has nothing to do with reality. Luckily for us, on PostgreSQL 9.6+ there is a `pg_stat_wal_receiver` view, which contains current values of `primary_conninfo` and `primary_slot_name`. The password field is masked through, but this is fine, because authentication happens only during opening the connection. All other parameters we compare as usual.

Another advantage of `pg_stat_wal_recevier` - it contains the current timeline, therefore on 9.6+ we don't need to use the replication connection trick if walreceiver process is alive.

If there is no walreceiver process available or it is not streaming we will stick to old methods.
2020-05-15 18:04:24 +02:00
Alexander Kukushkin
08b3d5d20d Move ensure_clean_shutdown into rewind module (#1528)
Logically fits there better
2020-05-15 16:22:57 +02:00
Alexander Kukushkin
7cf0b753ab Update optime/leader with checkpoint location after clean shut down (#1527)
Potentially this information could be used in order to make sure that there is no data loss on switchover.
2020-05-15 16:13:16 +02:00
Alexander Kukushkin
80fbe90056 Issue CHEKPOINT explicitely after promote happened (#1498)
It is safe to call pg_rewind on the replica only when pg_control on the primary contains information about the latest timeline. Postgres is usually doing immediate checkpoint right after promote and in most cases it works just fine. Unfortunately we regularly receive complaints that it takes to long (minutes) until the checkpoint is done and replicas can't perform rewind. At the same time doing the checkpoint manually immediately helped. So Patroni starts doing the same. When the promotion happened and postgres is not running in recovery, we explicitly issue the checkpoint.

We are intentionally not using the AsyncExecutor here, because we want the HA loop continues doing its normal flow.
2020-04-20 11:55:05 +02:00
ksarabu1
e3335bea1a Master stop timeout (#1445)
## Feature: Postgres stop timeout

Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or  hangs for some reason(Issue described in [1371](https://github.com/zalando/patroni/issues/1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor.

### master_stop_timeout

The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
2020-04-15 12:18:49 +02:00
Alexander Kukushkin
613634c26b Reset rewind state if postgres started after successful pg_rewind (#1408)
Close https://github.com/zalando/patroni/issues/1406
2020-02-27 12:24:17 +01:00
Alexander Kukushkin
4a29caa9d3 On role change callback didn't fire on failed primary (#1420)
Bug was introduced in https://github.com/zalando/patroni/pull/703
Close https://github.com/zalando/patroni/issues/1418
2020-02-27 12:22:44 +01:00
Alexander Kukushkin
16d1ffdde7 Update timeline on standby cluster (#1332)
Fixes https://github.com/zalando/patroni/issues/1031
2019-12-20 12:56:00 +01:00
Igor Yanchenko
726ee46111 Implemented patroni --version (#1291)
That required a refactoring of `Config` and `Patroni` classes. Now one has to explicitely create the instance of `Config` before creating `Patroni`.

The Config file can optionally call the validate function.
2019-12-02 12:14:19 +01:00
Alexander Kukushkin
412c720d3a Avoid importing all DCS modules (#1286)
We will try to import only the module which has a configuration section.
I.e. if there is only zookeeper section in the config, Patroni will try to import only `patroni.dcs.zookeeper` and skip `etcd`, `consul`, and `kubernetes`.
This approach has two benefits:
1. When there are no dependencies installed Patroni was showing INFO messages `Failed to import smth`, which looks scary.
2. It reduces memory usage, because sometimes dependencies are heavy.
2019-11-21 14:39:37 +01:00
Alexander Kukushkin
5ea73d50ed Make it possible to apply some recovery params without restart (#1260)
Starting from PostgreSQL 12 the following recovery parameters could be changed without restart, but Patroni didn't yet support it:
* archive_cleanup_command
* promote_trigger_file
* recovery_end_command
* recovery_min_apply_delay

In future postgres releases this list will be extended and Patroni will support it automatically.
2019-11-11 16:18:23 +01:00
Alexander Kukushkin
863aed314b Fix race conditions in async actions (#1215)
Specifically, there was a chance that `patronictl reinit --force` was overwritten by recover and we end up in a situation when Patroni was trying to start the postgres while basebackup still running.
2019-10-11 10:17:02 +02:00
Alexander Kukushkin
b666f5e4ed Refactor Patroni REST API communication (#1197)
* make it possible to use client certificates with REST API
* define a separate PatroniRequest class which handles all communication
* refactor patronictl to use the new class
* make Ha to use the new class instead of calling requests.get. The old call wasn't taking into account certificates and basic-auth

Close #898
2019-10-11 10:16:33 +02:00
Alexander Kukushkin
1572c02ced Use passfile in the primary_conninfo instead of password (#1194)
Fixed a few minor issues related to the #1134 and #1122
Close https://github.com/zalando/patroni/issues/1185
2019-10-09 18:04:14 +02:00
Alexander Kukushkin
a4bd6a9b4b Refactor postgresql class (#1060)
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package

This is just a careful refactoring, without code changes.
2019-05-21 16:02:47 +02:00
Alexander Kukushkin
e54dfa508d Consider sync node as a healthy even when the former leader is ahead (#1059)
Fixes https://github.com/zalando/patroni/issues/1054
2019-05-13 16:32:53 +02:00
Alexander Kukushkin
4b48653d09 More standby cluster bugfixes (#1053)
1. use the default port is 5432 when only standby_cluster.host is defined
2. check that standby_cluster replica can be bootstrapped without connection to the standby_cluster leader against `create_replica_methods` defined in the `standby_cluster` config instead of the `postgresql` section.
3. Don't fallback to the create_replica_methods defined in the `postgresql` section when bootstrapping a member of the standby cluster.
4. Make sure we specify the database when connecting to the leader.
2019-05-13 14:19:22 +02:00
Alexander Kukushkin
bba9066315 Make it possible to run pg_rewind without superuser on pg11+ (#1035)
* expose the current patroni version in DCS
* expose `checkpoint_after_promote` flag in DCS as an indicator that pg_rewind could be safely executed
* other nodes will wait until this flag is set instead of connecting as superuser and issuing the CHECKPOINT
* define `postgresql.authention.rewind` with credentials for pg_rewind in patroni configuration files.
* create user for pg_rewind if postgres is 11+
* grant execute on functions required for pg_rewind to rewind user
2019-05-02 14:07:26 +02:00
Alexander Kukushkin
f0b784fe7f Manage pg_ident.conf with Patroni (#1037)
This functionality works similarly to the `pg_hba`:
If the `postgresql.pg_ident` is defined in the config file or DCS, Patroni will write its value to pg_ident.conf, however, if `postgresql.parameters.ident_file` is defined, Patroni will assume that pg_ident is managed from outside and not update the file.
2019-04-23 16:16:53 +02:00
Alexander Kukushkin
e38fe78b56 Fix callbacks behavior (mostly for standby cluster) (#998)
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.

In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.

Closes https://github.com/zalando/patroni/issues/988
2019-03-29 10:28:07 +01:00