patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 10:20:10 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	4a4a7dab45	Update supported Postgres versions (#2857 ) v3.1.1	2023-09-20 15:05:53 +02:00
Alexander Kukushkin	2f8d0f9662	Stick with sphinx_rtd_theme (#2873 ) by default they are using something else	2023-09-20 14:59:35 +02:00
Polina Bungina	40f9c02606	Pin sphinx_rtd_theme to >1 (#2825 ) Earlier versions are incompatible with sphinx>7	2023-09-20 14:58:35 +02:00
Alexander Kukushkin	ce51eb02d0	Mock request() method when running tests (#2802 ) 1. Unit tests should not really try accessing any resources. 2. Not doing so results in significant execution time of unit tests on Windows In addition to that perform a request with timeout 3s. Usually this is more than enough to figure out whether resource is accessible. Followup on #2724	2023-09-20 14:26:12 +02:00
Matt Baker	e796198045	Generate API docs from code with sphinx autodoc (#2699 ) Expanding on the addition of docstrings in code, this adds python module API docs to sphinx documentation. A developer can preview what this might look like by running this locally: ``` tox -m docs ``` The option `-W` is added to the tox env so that warning messages are considered errors. Adds doc generation using the above method to the test GitHub workflow to catch documentation problems on PRs. Some docstrings have been reformatted and fixed to satisfy errors generated with the above setup.	2023-09-20 12:35:14 +02:00
Matt Baker	dde2331160	Add docs to `patroni.dcs.__init__.py` (#2777 ) Also, made some small code changes to satisfy formatting and pylint.	2023-09-20 12:31:55 +02:00
Matt Baker	bcafe91a55	Add docstrings to `patroni.postgresql.slots.py` (#2778 ) Also, made some small code changes to satisfy formatting and pylint.	2023-09-20 12:30:57 +02:00
Alexander Kukushkin	f7e99749ef	Release v3.1.1 (#2872 ) * Bump version * Update release notes * Update tox.ini (include v16) * Enable tests for `REL*` branches	2023-09-20 12:13:52 +02:00
Israel	b31f590700	`patronictl --help` was showing `ctl` function's docstring (#2845 ) `patronictl` is implemented using `click` module, and that module uses the functions' docstrings for creating a helper text. As a consequence the docstring for `ctl` function was being shown to the user, which doesn't make sense. This PR fixes that issue by adding a user-friendly description to be shown on `patronictl --help`. We use a `\f` to tell `click` when to stop capturing text to show in the helper. Note that `patronictl` commands are implemented using `@ctl.command` decorator, and we always provide them with `help` argument. That said, none of the subcommands are affected by the aforementioned issue, only the entry point of the CLI. References: PAT-201.	2023-09-20 12:13:42 +02:00
Alexander Kukushkin	9d7e7174fc	Bump pyright version (#2871 ) and fix all reported issues. We aren't sticking to the latest version this time because it has [a bug](https://github.com/microsoft/pyright/issues/5968).	2023-09-20 12:13:23 +02:00
Alexander Kukushkin	6e82de8751	Don't rely on pg_stat_wal_receiver when deciding on pg_rewind (#2863 ) As was reported by @ants on Slack it could happen that `received_tli` is ahead of replayed timeline, therefore we should stop using it when deciding on pg_rewind if postgres is running and use only `IDENTIFY_SYSTEM` via replication connection.	2023-09-18 13:01:08 +02:00
Polina Bungina	85db209c19	Always store CMDLINE_OPTIONS config values as int (#2861 )	2023-09-18 12:59:55 +02:00
Israel	564dd7e7af	Fix bug in `patronictl query` command (#2859 ) Previous to this commit `patronictl query` was working only if `-r` argument was provided to the command. Otherwise it would face issues: * If neither `-r` nor `-m` were provided: ``` PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT" 2023-09-12 17:45:38 No connection to role=None is available ``` * If only `-m` was provided: ``` $ PGPASSWORD=zalando patronictl -c postgres0.yml query -U postgres -c "SHOW PORT" -m postgresql0 2023-09-12 17:46:15 No connection to member postgresql0 is available ``` This issue was a regression introduced by `4c3e0b9382820524239d2aa4d6b95379ef1291db` through PR #2687. Through that PR we decided to move the common logic used to check mutually exclusiveness of `--role` and `--member` arguments to `get_any_member` function. However, previous to that change `role` variable would assume the default value of `any` in `query` method, before `get_any_member` was called, which was not the case after the change. This commit fixes that issue by adding a handler in `get_cursor` function to `role=None`. As `role` defaulting to `any` is handled in a sub-call to `get_any_member`, we are apparently safe in `get_cursor` to return the cursor if `role=None`. Unit tests were updated accordingly. References: PAT-204.	2023-09-18 12:59:55 +02:00
Alexander Kukushkin	95ed90183b	Don't start stopped postgres in pause (#2848 ) Due to a race condition Patroni was falsely assuming that the standby should be restarted because some recovery parameters (primary_conninfo or similar) were changed. Close https://github.com/zalando/patroni/issues/2834	2023-09-18 12:59:55 +02:00
Alexander Kukushkin	33e02e14ca	Override write_leader_optime method in K8s implementation (#2850 ) It is being called when postgres is already shut down cleanly but there are no healthy replicas to take it over. Close https://github.com/zalando/patroni/issues/2837 Close https://github.com/zalando/patroni/pull/2838	2023-09-18 12:59:55 +02:00
Polina Bungina	e38d7d4e9f	Return system id to the ctl list title (#2840 )	2023-09-18 12:59:55 +02:00
Alexander Kukushkin	24bf2f3fa0	Fix bug with kubernetes.standby_leader_label_value (#2832 ) When running with the leader lock Patroni was just setting the `role` label to `master` and effectively `kubernetes.standby_leader_label_value` feature never worked. Now it is fixed, but in order to not introduce breaking changes we just update default value of the `standby_leader_label_value` to the `master`.	2023-09-18 12:59:55 +02:00
Alexander Kukushkin	1849bd1a56	Don't return logical slots for standby cluster (#2816 ) Cluster.get_replication_slots() didn't take into account that there can not be logical replication slots in a standby cluster replicas. It was only skipping logical slots for the standby_leader, but replicas were expecting that they will have to copy them over. Also on replicas in a standby cluster these logical slots were falsely added to the `_replication_slots` dict.	2023-09-18 12:59:55 +02:00
Israel	37643b5a8b	Fix `IntValidator` regarding validation of value `0` (#2818 ) Previous to this commit `IntValidator` would always consider the value `0` invalid, even if in the allowed range. The problem was that `parse_int` was returning `0` in the following line: ```python value = parse_int(value, self.base_unit) or "" ``` However the `or ""` was evaluating to an empty string. As `parse_int` returns either an `int` if able to parse, or `None` otherwise, the `isinstance(value, int)` is enough to error out when not a valid `int`. Closes #2817	2023-09-18 12:59:54 +02:00
Alexander Kukushkin	f659edd60f	Explicitly enable synchronous mode (#2820 ) Close https://github.com/zalando/patroni/issues/2819 Co-authored-by: Polina Bungina <27892524+hughcapet@users.noreply.github.com>	2023-09-18 12:59:36 +02:00
Alexander Kukushkin	6d548aefbe	Silence useless warnings in patronictl (#2808 ) Close https://github.com/zalando/patroni/issues/2805	2023-09-18 12:59:18 +02:00
ChenChangAo	783112385f	reset failsafe state when promote (#2803 ) consider the scenario(enable failsafe_mode): 0. node1(primary) - node2(replica) 1. stop all etcd nodes; wait ttl seconds; start all etcd nodes; (node2's failsafe will contain the info about node1) 2. switchover to node2; (node2's failsafe still contain the info about node1) 3. stop all etcd nodes; wait ttl seconds; start all etcd nodes; 4. node2 will demote because it consider node1 as primary Resetting failsafe state when running as a primary fixes the issue.	2023-09-18 12:59:17 +02:00
Alexander Kukushkin	84aac437c1	Release v3.1.0 (#2801 ) - bump pyright and resolve reported issues - bump Patroni version - update release notes v3.1.0	2023-08-03 13:02:29 +02:00
Israel	48e3d31e1d	Refactor docs about migration to Patroni (#2796 ) This PR is an attempt of refactoring the docs about migration to Patroni. These are a few enhancements that we propose through this PR: * Docs used to mention the procedure can only be performed in a single-node cluster. We changed that so the procedure considers a cluster composed of primary and standbys; * Teach how to deal with pre-existing replication slots; * Explain how to create the user for `pg_rewind`, if user intends to enable `use_pg_rewind`. References: PAT-143.	2023-08-03 09:01:16 +02:00
Alexander Kukushkin	01d07f86cd	Set permissions for files and directories created in PGDATA (#2781 ) Postgres supports two types of permissions: 1. owner only 2. group readable By default the first one is used because it provides better security. But, sometimes people want to run a backup tool with the user that is different from postgres. In this case the second option becomes very useful. Unfortunately it didn't work correctly because Patroni was creating files with owner access only permissions. This PR changes the behavior and permissions on files and directories that are created by Patroni will be calculated based on permissions of PGDATA. I.e., they will get group readable access when it is necessary. Close #1899 Close #1901	2023-08-02 13:15:43 +02:00
Matt Baker	b6fc4bc393	Replace instances of typing `Generator[X, None, None]` with `Iterator[X]` (#2799 )	2023-08-02 12:36:29 +02:00
Israel	018a2f4dd9	Enhance docs of `slots` dynamic configuration (#2797 ) The docs of `slots` configuration used to have this mention: ``` my_slot_name: the name of replication slot. If the permanent slot name matches with the name of the current primary it will not be created. Everything else is the responsibility of the operator to make sure that there are no clashes in names between replication slots automatically created by Patroni for members and permanent replication slots. ``` However that is not true in the sense that Patroni does not check for clashes between `my_slot_name` and the name of replication slots created for replicating changes among members. If you specify a slot name that clashes with the name of a replication slot used by a member, it turns out Patroni will make the slot permanent in the primary even if the member key expire from the DCS. Through this commit we also enhance the docs in terms of explaining that physical permanent slots are maintained only in the primary, while logical replication slots are copied from primary to standbys. Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>	2023-08-01 15:40:07 +02:00
Alexander Kukushkin	b7caf3b7f2	Fix behaviour of replicas in standby cluster in pause (#2795 ) When the leader key expires replicas should not follow the remote node but keep `primary_conninfo` as it is.	2023-08-01 14:09:46 +02:00
Alexander Kukushkin	ec61aede85	Fix bug in the Cluster class (#2794 ) The `workers` attribute when not passed explicitly was set to the same mutable `dict` object every time. Problem was introduced in #2652	2023-08-01 13:57:46 +02:00
Alexander Kukushkin	e4703d4f74	Refactor replica_list (#2790 ) As suggested in https://github.com/zalando/patroni/pull/2668/files#r1276115738, introduce a couple of classes that represent a single replica and collection of replicas.	2023-07-31 15:52:43 +02:00
Israel	a26e46cf76	Fix `replicatefrom` tag in `postgres2.yml` (#2788 ) The node `postgresql2` in the repository is apparently supposed to be a cascading standby. However, if that is the case, there is a typo in the name of its upstream node. This commit fixes that issue.	2023-07-31 15:41:14 +02:00
Alexander Kukushkin	94bfea1a81	Do not fail validation for a value that is fine (#2791 ) In issue #2735 it was discussed that there should be some warning around PostgreSQL parameters that do not pass validation. This commit ensures something is logged for parameters that fail validation and therefore fall back to default values. Close #2735 Close #2740	2023-07-31 11:35:30 +02:00
Alexander Kukushkin	01976ec10b	Don't allow stale primary to win the leader race (#2787 ) Consider a following situation: 1. node1 is stressed so much that Patroni heart-beat can't run regularly and the leader lock expires. 2. node2 notice that there is no leader, gets the lock, promotes, and gets to a situation like it is described in 1. 3. Patroni on node1 finally wakes up, notice that Postgres is running as a primary, but without a leader lock and "happily" acquires the lock. That is, node1 discarded promoting of node2, and the node2 after that it will not be possible to join the node2 back to the cluster, because pg_rewind is not possible when two nodes are on the same timeline. To partially mitigate the problem we introduce an additional timeline check. If postgres is running as primary Patroni will consider it as a perfect candidate only if timeline isn't behind the last known cluster timeline recorder in the `/history` key. If postgres timeline is behind the cluster timeline postgres will be demoted to read-only. Further behavior would depend on `maximum_lag_on_failover` and `check_timeline` settings. Since the `/history` key isn't updated instantly after promotion, there is still a short period of time when the issue could happen, but it seems that it is close to impossible to make it more reliable. Close https://github.com/zalando/patroni/issues/2779	2023-07-31 11:22:18 +02:00
Alexander Kukushkin	8f3ed00886	Invalidate cache if txn failed due to revision mismatch (#2783 ) It was reported in #2779 that the primary was constantly logging messages like `Synchronous replication key updated by someone else`. It happened after Patroni was stuck due to resource starvation. Key updates are performed using create_revision/mod_revision field, which value is taken from the internal cached. Hence, it is a clear symptom of stale cache. Similar issues in K8s implementation were addressed by invalidating the cache and restarting watcher connections every time when update failed due to resource_version mismatch, so we do the same for Etcd3.	2023-07-31 10:16:19 +02:00
Alexander Kukushkin	7e89583ec7	Please new flake8 (#2789 ) it stopped liking lack of space character between `,` and `\` ```python foo,\ bar ```	2023-07-31 09:08:46 +02:00
Alexander Kukushkin	2735c937fd	Fix pg_rewind behaviour after pause (#2776 ) On Slack user reported that Patroni didn't run pg_rewind on one of the nodes after coming out of maintenance mode. Steps that were executed: 0. The initial state: node1 - primary, node2 - replica 1. `patronictl pause` 2. On node2: pg_ctl promote 3. On node1: pg_ctl stop 4. Patroni on node1 notice that Postgres isn't running and removes the leader lock 5. Patroni on node2 notice that Postgres is running as a primary and takes the leader lock. 6. `patronictl resume`. After that node1 started saying in logs: `Waiting for checkpoint on node2 before rewind`. Repeating this steps may not necessarily reproduce the problem, because presumably pg_rewind failed earlier on node1. Such situation was possible because promote wasn't executed by Patroni and therefore `Rewind._state` wasn't explicitly reset and the code that ensures that CHECKPOINT after promote was run wasn't triggered. As a mitigation following changes have been made: 1. retrigger pg_rewind checks after coming out of maintenance mode 2. run ensure CHECKPOINT after promote checks using `Rewind._state != REWIND_STATUS.CHECKPOINT` condition. It allowed to remove useless hook from `Postgresql.promote()`.	2023-07-27 13:39:28 +02:00
Alexander Kukushkin	384a2a4d8f	Avoid unnecessary updates of /status key (#2782 ) When we don't have permanent logical slots Patroni was updating the `/status` on every heart-beat loop even when LSN on the primary isn't moving forward. The issue was introduced in the #2485	2023-07-27 13:38:24 +02:00
Alexander Kukushkin	238aba3956	Fix patronictl list (#2775 ) the `Cluster` name field was missing in tsv, json, and yaml formats The bug was introduced in #2652	2023-07-26 12:33:17 +02:00
Alexander Kukushkin	ae2bbd28ae	Fix `in_recovery` check (#2773 ) The primary that is still alive wasn't properly recognized. Regression was introduced in #2652	2023-07-25 11:50:40 +02:00
Waynerv	0e19e3e98e	Make pod role label configurable (#2659 ) Close #2495	2023-07-25 10:29:04 +02:00
Alexander Kukushkin	06db296612	Fixes in `patroni.request` (#2768 ) 1. Take client certificates only from the `ctl` section. Motivation: sometimes there are server-only certificates that can't be used as client certificates. As a result neither Patroni not patronictl work correctly even if `--insecure` option is used. 2. Document that if `restapi.verify_client` is set to `required` then client certificates must be provided in the `ctl` section. 3. Add support for `ctl.authentication` and prefer to use it over `restapi.authentication`. 4. Silence annoying InsecureRequestWarning when `patronictl -k` is used, so that behavior becomes is similar to `curl -k`.	2023-07-25 08:48:18 +02:00
Matt Baker	817f39ad6d	Refactor get_dcs (#2747 ) Now uses generators instead of for loops and implements importing modules once.	2023-07-25 08:01:35 +02:00
Matt Baker	c5a4befdc4	Refactor `check_logical_slots_readiness` split to reduce complexity (#2749 ) Includes: * renaming of `_unready_logical_slots` to better represent that it is a processing queue which is emptied on successful completion. * ensuring that return type is consistent. * made logic variable names explicit to help explain how the decision of whether a slot is "ready" is made.	2023-07-25 08:00:59 +02:00
Polina Bungina	e860cac348	Fix manual failover/switchover checks (#2769 ) In case of manual failover/switchover failover possibility should be checked only against the candidate	2023-07-24 16:23:18 +02:00
Matt Baker	48164774c2	Refactor get replication slots (#2746 ) Reduce complexity of single method and allow for documentation of distinct parts. No functional changes have been introduced.	2023-07-24 14:57:34 +02:00
Alexander Kukushkin	0a8fb0860e	Skip flaky scenario when running with Raft (#2771 ) Sometimes Patroni doesn't see the latest Raft data on start.	2023-07-21 16:09:34 +02:00
Polina Bungina	ffd1ad97d2	Fix Dockerfile_s (#2770 ) * Install dumb-init using apt * Remove python 2.7 packages purge	2023-07-21 15:10:13 +02:00
Waynerv	84c574e1ec	Run archive_command through shell (#2766 ) Close #2764	2023-07-21 14:30:22 +02:00
Alexander Kukushkin	cb9998ade6	Update docstring in do_GET_metrics() (#2765 ) followup #2733	2023-07-20 13:24:17 +02:00
Alexander Kukushkin	4830e36e2b	Start primary back when it crashed with "in crash recovery" state (#2763 ) If one of backends crashed postmaster stops all backend and does crash recovery because shared memory could be corrupted. If something happens during this phase postgres may completely crash leaving pg_control with "in crash recovery" state. Followup on #2726	2023-07-20 13:23:55 +02:00

1 2 3 4 5 ...

2253 Commits