1. stop using the same cursor all the time, it creates problems when not carefully used from different threads.
2. introduce query() method in the Connection class and make it return a result set when it is possible.
3. refactor most of the code that is relying (directly or indirectly) on the Connection object to use the query() method as much as possible.
This refactoring helps with reducing code complexity and will help with future introduction of a separate database connection for the REST API thread. The last one will help to improve reliability when system is under significant stress when simple monitoring queries are taking seconds to execute and the REST API starts blocking the main thread.
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.
In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.
Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 | 0 |
+-------------+----------------+---------+---------------------+----+-----------+
$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader | running | 12 | |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
```
Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544480,
"replayed_location": 335544480,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "in archive recovery",
"dcs_last_seen": 1688642069,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
$ curl -s localhost:8009 | jq .
{
"state": "running",
"postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
"role": "replica",
"server_version": 150003,
"xlog": {
"received_location": 335544816,
"replayed_location": 335544816,
"replayed_timestamp": null,
"paused": false
},
"timeline": 12,
"replication_state": "streaming",
"dcs_last_seen": 1688642089,
"database_system_identifier": "7252327498286490579",
"patroni": {
"version": "3.0.3",
"scope": "batman"
}
}
```
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
keep as much backward compatibility as possible.
Following changes were made:
1. All internal checks are performed as `role in ('master', 'primary')`
2. All internal variables/functions/methods are renamed
3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`.
4. Logs are changed to use leader/primary/member/remote depending on the context
5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works.
6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`.
7. `master_(start|stop)_timeout` is automatically translated to `primary_(start|stop)_timeout` if the last one is not set.
8. updated the documentation and some examples
Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether.
The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.
There is currently a risk to lose some WAL segments entirely in case
archive_mode was set to 'on' before a promotion and there are some WALs
with .ready files on the former leader we are trying to rewind. It happens
because of the pg_rewind's modus operandi: it simply syncs the content of
pg_wal directory of the old leader with the new leader's one. Including
deletion of all WALs that are not present on the current leader, regardless
their archive status on the former one. Thus, in case the new leader has
already recycled such files, we just remove them entirely.
In case archive_mode was set to 'always' and the .ready WALs are acrually
present in archive, it is for end user who writes the archive_command to
avoid overwritting and to properly test it.
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
Patroni was falsely assuming that timelines have diverged.
For pg_rewind it didn't create any problem, but if pg_rewind is not allowed and the `remove_data_directory_on_diverged_timelines` is set, it resulted in reinitializing the former leader.
Close https://github.com/zalando/patroni/issues/2220
When restore_command is configured Postgres is trying to fetch/apply all possible WAL segments and also fetch history files in order to select the correct timeline. It could result in a situation where the new history file will be missing some timelines.
Example:
- node1 demotes/crashes on timeline 1
- node2 promotes to timeline 2 and archives `00000002.history` and crashes
- node1 recovers as a replica, "replays" `00000002.history` and promotes to timeline 3
As a result, the `00000003.history` will not have the line with timeline 2, because it never replayed any WAL segment from it.
The `pg_rewind` tool is supposed to correctly handle such case when rewinding node2 from node1, but Patroni when deciding whether the rewind should happen was searching for the exact timeline in the history file from the new primary.
The solution is to assume that rewind is required if the current replica timeline is missing.
In addition to that this PR makes sure that the primary isn't running in recovery before starting the procedure of rewind check.
Close https://github.com/zalando/patroni/issues/2118 and https://github.com/zalando/patroni/issues/2124
1. Avoid doing CHECKPOINT if `pg_control` is already updated.
2. Explicitly call ensure_checkpoint_after_promote() right after the bootstrap finished successfully.
Effectively, this PR consists of a few changes:
1. The easy part:
In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.
2. The tricky part:
On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function.
When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.
3. Additional requirements:
In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.
4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.
Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.
The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.
Close: https://github.com/zalando/patroni/issues/1749
1. If the superuser name is different from postgres, the pg_rewind in the standby cluster was failing because the connection string didn't contain the database name.
2. Provide output if the single-user mode recovery failed.
Close https://github.com/zalando/patroni/pull/1736
It could happen that the WAL segment required for `pg_rewind` doesn't exist in the `pg_wal` anymore and therefore `pg_rewind` can't find the checkpoint location before the diverging point.
Starting from PostgreSQL 13 `pg_rewind` could use `restore_command` for fetching missing WALs, but we can do better than that.
On older PostgreSQL versions Patroni will parse the stdout and stderr of failed rewind attempt, try to fetch the missing WAL by calling the `restore_command`, and repeat an attempt.
We don't need to rewind when:
1. replayed location for the former replica is not ahead of switchpoint
2. end of checkpoint record for the former primary is the same as switchpoint
In order to get the end of checkpoint record we use the `pg_waldump` and parse its output.
Close https://github.com/zalando/patroni/issues/1493
when Patroni is trying to figure out the necessity of pg_rewind it could write the content history file from the primary into the log. The history file is growing with every failover/switchover and eventually starts taking too many lines in the log, most of them are not so much useful.
Instead of showing the raw data, we will show only 3 lines before the current replica timeline and 2 lines after.
Replicas are waiting for checkpoint indication via member key of the leader in DCS. The key is normally updated only one time per HA loop.
Without waking the main thread up replicas will have to wait up to `loop_wait` seconds longer than necessary.
It is safe to call pg_rewind on the replica only when pg_control on the primary contains information about the latest timeline. Postgres is usually doing immediate checkpoint right after promote and in most cases it works just fine. Unfortunately we regularly receive complaints that it takes to long (minutes) until the checkpoint is done and replicas can't perform rewind. At the same time doing the checkpoint manually immediately helped. So Patroni starts doing the same. When the promotion happened and postgres is not running in recovery, we explicitly issue the checkpoint.
We are intentionally not using the AsyncExecutor here, because we want the HA loop continues doing its normal flow.
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package
This is just a careful refactoring, without code changes.