Commit Graph

259 Commits

Author SHA1 Message Date
Alexander Kukushkin
bda07fa526 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2024-04-02 12:10:17 +02:00
Polina Bungina
9b237b332e Set global_config from dynamic_config if DCS data is empty (#3038)
Fix the oversight of 193c73f
We need to set global config from the local cache if cluster.config is not initialized.
If there is nothing written into the DCS (yet), we need the setup info for the decision making (e.g., if it is a standby cluster)
2024-03-28 08:15:58 +01:00
Grigory Smolkin
b09af642e6 Disable WAL streaming on standby node via new boolean tag "nostream" (#2842)
Add support for ``nostream`` tag. If set to ``true`` the node will not use replication protocol to stream WAL. It will rely instead on archive recovery (if ``restore_command`` is configured) and ``pg_wal``/``pg_xlog`` polling. It also disables copying and synchronization of permanent logical replication slots on the node itself and all its cascading replicas. Setting this tag on primary node has no effect.
2024-03-20 10:10:53 +01:00
zhjwpku
e131065d74 rename citus_handler to mpp_handler (#2991)
obey the following 5 meanings of terminology _cluster_ in Patroni.

1. PostgreSQL cluster: a cluster of postgresql instances which have the same system identifier.
2. MPP cluster: a cluster of PostgreSQL clusters that one of them acts as Coodinator and others act as workers.
3. Coordinator cluster: a PostgreSQL cluster which act the role of 'coordinator' within a MPP cluster.
4. Worker cluster: a PostgreSQL cluster which act the role 'worker' within a MPP cluster.
5. Patroni cluster: all cluster managed by Patroni can be called Patroni cluster, but we usually use this term to refering a single PostgreSQL cluster or an MPP cluster.
2024-02-28 06:16:20 +01:00
Alexander Kukushkin
59ecfb1799 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2024-01-05 10:22:35 +01:00
Alexander Kukushkin
bcfd8438a5 Abstract CitusHandler and decouple it from configuration (#2950)
the main issue was that the configuration for Citus handler and for DCS existed in two places, while ideally AbstractDCS should not know many details about what kind of MPP is in use.

To solve the problem we first dynamically create an object implementing AbstractMPP interfaces, which is a configuration for DCS. Later this object is used to instantiate the class implementing AbstractMPPHandler interface.

This is just a starting point, which does some heavy lifting. As a next steps all kind of variables named after Citus in files different from patroni/postgres/mpp/citus.py should be renamed.

In other words this commit takes over the most complex part of #2940, which was never implemented.

Co-authored-by: zhjwpku <zhjwpku@gmail.com>
2023-12-21 08:58:26 +01:00
Alexander Kukushkin
0e6a2ff3a9 Don't let replica restore initialize key when DCS was wiped (#2970)
It was happening from the branch where Patroni was supposed to be complain about converting standalone PG cluster to be governed by Patroni and exit.
2023-12-05 08:30:20 +01:00
Alexander Kukushkin
13cc86f851 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-11-24 14:42:15 +01:00
Alexander Kukushkin
193c73f6b8 Make GlobalConfig really global (#2935)
1. extract `GlobalConfig` class to its own module
2. make the module instantiate the `GlobalConfig` object on load and replace sys.modules with the this instance
3. don't pass `GlobalConfig` object around, but use `patroni.global_config` module everywhere.
4. move `ignore_slots_matchers`, `max_timelines_history`,  and `permanent_slots` from `ClusterConfig` to `GlobalConfig`.
5. add `use_slots` property to global_config and remove duplicated code from `Cluster` and `Postgresql.ConfigHandler`.

Besides that improve readability of couple of checks in ha.py and formatting of `/config` key when saved from patronictl.
2023-11-24 09:26:05 +01:00
Alexander Kukushkin
552e8643d9 Verify that replica nodes received checkpoint LSN on shutdown (#2939)
In case if archiving is enabled the `Postgresql.latest_checkpoint_location()` method returns LSN of the prev (SWITCH) record, which points to the beginning of the WAL file. It is done in order to make it possible to safely promote replica which recovers WAL files from the archive and wasn't streaming when the primary was stopped (primary doesn't archive this WAL file).

But, in certain cases using the LSN pointing to SWITCH record was causing unnecessary pg_rewind, if replica didn't managed to replay shutdown checkpoint record before it was promoted.

In order to mitigate the problem we need to check that replica received/replayed exactly the shutdown checkpoint LSN. But, at the same time we will still write LSN of the SWITCH record to the `/status` key when releasing the leader lock.
2023-11-07 11:05:54 +01:00
Alexander Kukushkin
94e128c51a Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-10-25 10:43:21 +02:00
Mark Pekala
f5ee67fa1c Feature: failover priority (#2780)
The priority is configured with `failover_priority` tag. Possible values are from `0` till infinity, where `0` means that the node will never become the leader, which is the same as `nofailover` tag set to `true`. As a result, in the configuration file one should set only one of `failover_priority` or `nofailover` tags.

The failover priority kicks in only when there are more than one node have the same receive/replay LSN and are ahead of other nodes in the cluster. In this case the node with higher value of `failover_priority` is preferred. If there is a node with higher values of receive/replay LSN, it will become the new leader even if it has lower value of `failover_priority` (except when priority is set to 0).

Close https://github.com/zalando/patroni/issues/2759
2023-10-24 12:22:48 +02:00
Alexander Kukushkin
f32989124c Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-10-23 15:28:54 +02:00
Alexander Kukushkin
c5fffb3c97 Further work on permanent physical slots (#2891)
- Fixed issues with has_permanent_slots() method. It didn't took into account the case of permanent physical slots for members, falsely concluding that there are no permanent slots.
- Write to the status key only LSNs for permanent slots (not just for slots that exist on the primary).
  - Include pg_current_wal_flush_lsn() to slots feedback, so that slots on standby nodes could be advanced
- Improved behave tests:
  - Verify that permanent slots are properly created on standby nodes
  - Verify that permanent slots are properly advanced, including DCS failsafe mode
  - Verify that only permanent slots are written to the `/status`
2023-10-23 08:24:28 +02:00
Alexander Kukushkin
2f6678e4a7 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-09-26 12:01:05 +02:00
Alexander Kukushkin
c855b0bff9 Detect and solve inconsistency between /sync and actual sync nodes (#2877)
Patroni is changing `synchronous_standby_names` and the `/sync` key in a very specific order, first we add nodes to `synchronous_standby_names` and only after, when they are recognized as synchronous they are added to the `/sync` key. When removing nodes the order is different: they are first removed from the `/sync` key and only after that from the `synchronous_standby_names`.

As a result Patroni expects that either actual synchronous nodes will match with the nodes listed in the `/sync` key or that new candidates to synchronous nodes will not match with nodes listed in the `/sync` key. In case if `synchronous_standby_names` was removed from the `postgresql.conf`, manually, or due the the bug (#2876), the state becomes inconsistent because of the wrong order of updates.

To solve inconsistent state we introduce additional checks and will update the `/sync` key with actual names of synchronous nodes (usually empty set).
2023-09-26 11:14:20 +02:00
Alexander Kukushkin
9fccc058bf Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-09-14 15:56:55 +02:00
Alexander Kukushkin
238b8db91e Introduce Status class (#2853)
It represents the `/status` key in DCS and makes it easier to introduce new values stored in the `/status` key without need to refactor all DCS implementations.
2023-09-14 14:40:44 +02:00
Alexander Kukushkin
d1dff78326 Rename methods in unit tests to match names of methods we are testing 2023-09-12 09:11:57 +02:00
Alexander Kukushkin
67612f5667 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-09-12 09:07:11 +02:00
Polina Bungina
b31a4d55c9 Ensure strict failover/switchover definition difference (#2784)
- Don't set leader in failover key from patronictl failover
- Show warning and execute switchover if leader option is provided for patronictl failover command
- Be more precise in the log messages
- Allow to failover to an async candidate in sync mode
- Check if candidate is the same as the leader specified in api
- Fix and extend some tests
- Add documentation
2023-09-12 08:51:17 +02:00
Alexander Kukushkin
8e24d72f98 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-08-22 10:07:35 +02:00
Alexander Kukushkin
3333e78500 Factor out tags handling into a dedicated class (#2823)
The same (almost) logic was used in three different places:
1. `Patroni` class
2. `Member` class
3. `_MemberStatus` class

Now they all inherit newly intoduced `Tags` class.
2023-08-21 17:03:14 +02:00
Alexander Kukushkin
f51309d03e Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-08-18 15:12:28 +02:00
Alexander Kukushkin
704d36815a Explicitly enable synchronous mode (#2820)
Close https://github.com/zalando/patroni/issues/2819

Co-authored-by: Polina Bungina <27892524+hughcapet@users.noreply.github.com>
2023-08-17 12:33:15 +02:00
Alexander Kukushkin
f5cb888f80 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-08-14 17:06:42 +02:00
Alexander Kukushkin
efaba9f183 Rename Postgresql.is_leader() to is_primary() (#2809)
It'll help to avoid confusion with the Ha.is_leader() method.
2023-08-09 14:47:53 +02:00
Alexander Kukushkin
74b89d740f Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-08-08 16:14:46 +02:00
Polina Bungina
f24db395c6 Refactor is_failover_possible() (#2804)
* Refactor is_failover_possible()

Move all the members filtering inside the function.

* Remove check_synchronous parameter
* Add sync_mode_is_active() method and user it everywhere where it is appropriate
* Reduce nesting

---------

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2023-08-08 11:50:02 +02:00
Alexander Kukushkin
d6e3f25bd2 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-08-08 11:27:11 +02:00
Alexander Kukushkin
01d07f86cd Set permissions for files and directories created in PGDATA (#2781)
Postgres supports two types of permissions:
1. owner only
2. group readable

By default the first one is used because it provides better security. But, sometimes people want to run a backup tool with the user that is different from postgres. In this case the second option becomes very useful. Unfortunately it didn't work correctly because Patroni was creating files with owner access only permissions.

This PR changes the behavior and permissions on files and directories that are created by Patroni will be calculated based on permissions of PGDATA. I.e., they will get group readable access when it is necessary.

Close #1899
Close #1901
2023-08-02 13:15:43 +02:00
Alexander Kukushkin
1a0549d540 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-07-31 16:26:10 +02:00
Alexander Kukushkin
01976ec10b Don't allow stale primary to win the leader race (#2787)
Consider a following situation:
1. node1 is stressed so much that Patroni heart-beat can't run regularly and the leader lock expires.
2. node2 notice that there is no leader, gets the lock, promotes, and gets to a situation like it is described in 1.
3. Patroni on node1 finally wakes up, notice that Postgres is running as a primary, but without a leader lock and "happily" acquires the lock.

That is, node1 discarded promoting of node2, and the node2 after that it will not be possible to join the node2 back to the cluster, because pg_rewind is not possible when two nodes are on the same timeline.

To partially mitigate the problem we introduce an additional timeline check. If postgres is running as primary Patroni will consider it as a perfect candidate only if timeline isn't behind the last known cluster timeline recorder in the `/history` key. If postgres timeline is behind the cluster timeline postgres will be demoted to read-only. Further behavior would depend on `maximum_lag_on_failover` and `check_timeline` settings.

Since the `/history` key isn't updated instantly after promotion, there is still a short period of time when the issue could happen, but it seems that it is close to impossible to make it more reliable.

Close https://github.com/zalando/patroni/issues/2779
2023-07-31 11:22:18 +02:00
Alexander Kukushkin
7e89583ec7 Please new flake8 (#2789)
it stopped liking lack of space character between `,` and `\`
```python
foo,\
    bar
```
2023-07-31 09:08:46 +02:00
Alexander Kukushkin
538d621fed Address review feedback 2023-07-31 07:12:54 +02:00
Alexander Kukushkin
aa0c32167d Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-07-26 07:35:12 +02:00
Alexander Kukushkin
ae2bbd28ae Fix in_recovery check (#2773)
The primary that is still alive wasn't properly recognized.
Regression was introduced in #2652
2023-07-25 11:50:40 +02:00
Alexander Kukushkin
768c9ebb04 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-07-25 08:42:23 +02:00
Matt Baker
817f39ad6d Refactor get_dcs (#2747)
Now uses generators instead of for loops and implements importing modules once.
2023-07-25 08:01:35 +02:00
Alexander Kukushkin
e6d251bda0 Limit time spent in _process_quorum_replication by loop_wait seconds 2023-07-18 14:53:55 +02:00
Alexander Kukushkin
b0d8b21d49 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-07-13 12:55:39 +02:00
Alexander Kukushkin
d46ca88e6b Make it visible replication state on standbys (#2733)
To do that we use `pg_stat_get_wal_receiver()` function, which is available since 9.6. For older versions the `patronictl list` output and REST API responses remain as before.

In case if there is no wal receiver process we check if `restore_command` is set and show the state as `in archive recovery`.

Example of `patronictl list` output:
```bash
$ patronictl list
+ Cluster: batman -------------+---------+---------------------+----+-----------+
| Member      | Host           | Role    | State               | TL | Lag in MB |
+-------------+----------------+---------+---------------------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader  | running             | 12 |           |
| postgresql1 | 127.0.0.1:5433 | Replica | in archive recovery | 12 |         0 |
+-------------+----------------+---------+---------------------+----+-----------+

$ patronictl list
+ Cluster: batman -------------+---------+-----------+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgresql0 | 127.0.0.1:5432 | Leader  | running   | 12 |           |
| postgresql1 | 127.0.0.1:5433 | Replica | streaming | 12 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
```

Example of REST API response:
```bash
$ curl -s localhost:8009 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
  "role": "replica",
  "server_version": 150003,
  "xlog": {
    "received_location": 335544480,
    "replayed_location": 335544480,
    "replayed_timestamp": null,
    "paused": false
  },
  "timeline": 12,
  "replication_state": "in archive recovery",
  "dcs_last_seen": 1688642069,
  "database_system_identifier": "7252327498286490579",
  "patroni": {
    "version": "3.0.3",
    "scope": "batman"
  }
}

$ curl -s localhost:8009 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-07-06 13:12:00.595118+02:00",
  "role": "replica",
  "server_version": 150003,
  "xlog": {
    "received_location": 335544816,
    "replayed_location": 335544816,
    "replayed_timestamp": null,
    "paused": false
  },
  "timeline": 12,
  "replication_state": "streaming",
  "dcs_last_seen": 1688642089,
  "database_system_identifier": "7252327498286490579",
  "patroni": {
    "version": "3.0.3",
    "scope": "batman"
  }
}
```
2023-07-13 09:24:20 +02:00
Alexander Kukushkin
e4fe239a9d A few fixes in synchronous_mode (#2741)
- make sure that physical replication slots are created even before the promote happened (when async executor is busy with promote).
- execute `txid_current()` with `synchronous_commit=off` so it doesn't accidentally wait for absent synchronous standbys when `synchronous_mode_strict` is enable and `synchronous_standby_names=*`. These standbys can't connect because replication slots weren't there.
- `synchronous_standby_names` wasn't set to `*` after bootstrap with `synchronous_mode` and `synchronous_mode_strict`.
- add `-c statement_timeout=0` to `PGOPTIONS` when executing `post_bootstrap` script.

Close https://github.com/zalando/patroni/issues/2738
2023-07-12 09:43:40 +02:00
Alexander Kukushkin
6e96db173f Start postgres not in recovery in some cases (#2726)
If we know for sure that a few moments ago postgres was still running as a primary and we still have the leader lock and can successfully update it, in this case we can safely start postgres back not in recovery. That will allow to avoid bumping timeline without a reason and hopefully improve reliability because it will address issues similar to #2720.

In addition to that remove `if self.state_handler.is_starting()` check from the `recover()` method. This branch could never be reached because the `starting` state is handled earlier in the `_run_cycle()`. Besides that remove redundant `self._crash_recovery_executed`.

P.S. now we do not cover cases when Patroni was killed along with Postgres.
Lets consider that we just started Patroni, there is no leader, and `pg_controldata` reports `Database cluster state` as `shut down`. It feels logical to use `Latest checkpoint location` and `Latest checkpoint's TimeLineID` to do a usual leader race and start directly as a primary, but it could be totally wrong. The thing is that we run `postgres --single` if standby wasn't shut down cleanly before executing `pg_rewind`. As a result `Database cluster state` transition from `in archive recovery` to `shut down`, but if such a node becomes a leader the timeline must be increased.
2023-07-12 09:42:34 +02:00
Alexander Kukushkin
8f60b18f03 Delay _process_quorum_replication by loop_wait seconds after promote
It takes some time for existing standbys to start streaming from the
new primary and we want to do our best to not empty the /sync key before
that.
2023-05-23 14:01:58 +02:00
Alexander Kukushkin
f298921315 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-05-23 09:59:16 +02:00
Alexander Kukushkin
66a0e44371 Enable pyright job for every commit (#2675)
And fix remaining issues that the job doesn't fail.
2023-05-15 11:38:40 +02:00
Alexander Kukushkin
e97d2f0999 Implement synchronous_mode=quorum 2023-05-11 12:18:08 +02:00
Alexander Kukushkin
f5f0adba14 Compatibility with future synchronous_mode=quorum 2023-05-11 12:16:52 +02:00
Alexander Kukushkin
ea019ba549 Adapt SyncHandler interfaces for quorum commit 2023-05-11 11:20:36 +02:00