patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 18:20:05 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	39875f448c	Release v3.0.2 (#2617 ) - bump version - update release notes - update links to Postgres Slack - simplify /sync health-check endpoint code - update unit-tests to cover missing lines v3.0.2	2023-03-24 08:54:54 +01:00
Israel	a1095e385c	Handle `patronictl edit-config` diff pager in a more user friendly way (#2605 ) `patronictl edit-config` requires a pager to show the diff output back to the user. It used to be hard-coded to use either `less` or `more`. When these tools were not available in the host that would cause `patronictl` to face an exception in `ydiff` module and to show the stack trace in the console. This PR changes `patronictl edit-config` command to behave like this: - If `PAGER` environment variable is set, attempt to find the corresponding executable. - If `PAGER` is not set or is set with an invalid executable, then attempt to use either `less` or `more` as it used to do. - If no executable is find at all then throw a `PatroniCtlException` to show an user friendly message Unit tests in `tests/test_ctl.py` were modified accordingly. References: PAT-21 Close #2604	2023-03-23 13:43:48 +01:00
Israel	84353b88c9	Add docstrings and type annotations to `patroni/daemon.py` (#2610 ) References: PAT-38	2023-03-23 13:41:16 +01:00
T.v.Dein	60723f5fa4	Add metric to report about sync standby replica status (#2615 ) Close #2613 Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>	2023-03-23 09:32:29 +01:00
Alexander Kukushkin	a8b90f0cd6	Make sure Cluster.sync is never empty (#2614 ) It was possible to have it empty if the all cluster keys are missing in DCS. In this case the `Cluster` object was manually created with all values set to `None` or `[]` (including sync). It already resulted in #2217, which is in fact wasn't a correct fix. In order to solve it and reduce code duplication we introduce `Cluster.empty()` and `SyncState.empty()` methods, which will create corresponding empty objects and start using `Cluster.empty()` from all places where the empty `Cluster` object was manually created.	2023-03-22 16:41:41 +01:00
Israel	918674e7bb	Document code in `patroni/version.py` (#2611 ) References: PAT-39 Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>	2023-03-22 11:46:15 +01:00
Alexander Kukushkin	ddac8683e6	Use config file as a fallback when all current etcd nodes failed (#2599 ) If communication with etcd nodes failed it is logical to start from scratch, from nodes that are listed in the config. But, it could happen that config is in fact outdated and all nodes in the real cluster were replaced. Previously we used to track whether config file was changed, which turned out not to work in all possible cases. The new strategy is a bit more different - if communication with all nodes failed we will continue keeping the last know topology and at the same time will try to figure out the new one by merging two lists together, the cached list and the list from the config file.	2023-03-14 15:54:17 +01:00
Víctor Oriol i Aguilar	36c17e944b	high availability across multiple datacenter #2587 (#2598 ) documentations about how deploy a high availability across multiple datacenters Close #2587	2023-03-14 15:39:50 +01:00
Alexander Kukushkin	c1bfb0e6d6	Remove python 2.7 support (#2571 ) - get rid from 2.7 specific modules: `six`, `ipaddress` - use Python3 unpacking operator - use `shutil.which()` instead of `find_executable()`	2023-03-13 17:00:04 +01:00
Polina Bungina	373affe707	Use IMDSv2 in aws callback example script (#2590 )	2023-03-13 13:31:57 +01:00
Alexander Kukushkin	95ba8b9e59	Fix bug with metadata after coordinator failover (#2597 ) We made incorrect assumption that `citus_set_coordinator_host()` will trigger `pg_dist_node` sync. Instead we should also use `citus_update_node()` and call `citus_set_coordinator_host()` only during the bootstrap. Adjust behave tests to verify that coordinator failover is visible on workers.	2023-03-13 13:30:39 +01:00
Benoit	60a7e5a514	Fix typo in set_state: initializing new cluster (#2586 )	2023-03-10 09:41:17 +01:00
Alexander Kukushkin	eefa15b390	Make K8s retriable HTTP status code configurable (#2585 ) Configuration parameter is `kubernetes.retriable_http_codes` or `PATRONI_KUBERNETES_RETRIABLE_HTTP_CODES` environment variable. These status codes are added to the default list of 500, 503, 504. Close https://github.com/zalando/patroni/issues/2536	2023-03-10 09:38:12 +01:00
Alexander Kukushkin	8622fcea3d	Switch to GH forms for issues (#2594 ) and make link to #patroni channel on PostgreSQL Slack more visible	2023-03-10 09:37:41 +01:00
Alexander Kukushkin	2afcaa9d83	Don't write to PGDATA if major version is not known (#2583 ) It could happen that Patroni is started up before PGDATA was mounted. In this case Patroni can't determine major Postgres version from PG_VERSION file. Later, when PGDATA is mounted, Patroni was trying to create the recovery.conf even if the actual Postgres major version is newver than 12. To mitigate the problem we double check that the `Postgresql._major_version` is set before writing recovery configuration or starting postgres up. Close https://github.com/zalando/patroni/issues/2434	2023-03-06 16:33:32 +01:00
Alexander Kukushkin	09d0d78b74	Don't allow on_reload callback kill other callbacks (#2578 ) Since a long time Patroni enforcing only one callback script running at a time. If the new callback is executed while the old one is still running, the old one is killed (including all child processes). Such behavior is fine for all callbacks but on_reload, because the last one may accidentally cancel important ones, that for example updating DNS or assigning/removing Virtual IP. To mitigate the problem we introduce a dedicated executor for on_reload callbacks, so that on_reload may only cancel another on_reload. Ref: https://github.com/zalando/patroni/issues/2445	2023-03-06 16:33:03 +01:00
Burak Ergen	89595babdf	add "GET /metrics" rest_api.rst (#2576 )	2023-03-02 09:40:54 +01:00
Alexander Kukushkin	dff5537954	Compatibility with flake8>=5.0 (#2579 ) The main() function now returns exit code instead of exiting on it's own	2023-03-02 09:16:17 +01:00
Alexander Kukushkin	c985974ece	Set hot_standby=off only if recovery_target_action=promote (#2570 ) During custom bootstrap the `hot_standby` is set to off to protect postgres from panicking and shutting down when some parameters like `max_connections` are increased on the primary. According to the [documentation](https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-RECOVERY-TARGET-ACTION), `hot_standby` set to `off` affects behavior of the `recovery_target_action`, and `pause` starts acting as the `shutdown`: > If [hot_standby](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-HOT-STANDBY) is not enabled, a setting of pause will act the same as shutdown This is not what users expect/need, because normally they resolve pause state on their own. To solve the problem we will set `hot_standby` to `off` during custom bootstrap only if `recovery_target_action` is set to 'promote'. Close https://github.com/zalando/patroni/issues/2569	2023-02-28 10:08:42 +01:00
Lukáš Lalinský	388bb40b71	Fix `patronictl switchover` on Citus cluster running on Kubernetes (#2562 ) The patronictl code tries to initialize DCS twice, first for the current Citus group and the second time for the selected group. However, kubernetes.py was overwriting the namespace config. As a result, after the second initialization patronictl was trying to work with the `default` namespace instead of the configured one.	2023-02-28 10:07:27 +01:00
Polina Bungina	422047f105	Release 3.0.1 (#2561 ) * Bump version * Update release notes * Return 3.6 to supported versions in setup.py v3.0.1	2023-02-16 08:51:47 +01:00
Polina Bungina	b85f155dbe	Pass 'master' role to a callback script instead of 'promoted' (#2554 ) Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>	2023-02-08 14:09:51 +01:00
Alexander Kukushkin	1669a49b2d	Switch to Citus 11.2 (#2548 ) - Update Dockerfile.citus files - Enable behave tests with Citus	2023-02-03 15:29:25 +01:00
Alexander Kukushkin	8ac8ed6584	Update Citus link to the github.com repo (#2546 ) Per suggestion from @clairegiordano	2023-02-02 11:50:19 +01:00
Alexander Kukushkin	7869f5e211	Release 3.0.0 (#2545 ) * bump version * update release notes * removed 2.7, 3.4, 3.5, and 3.6 from supported versions in setup.py * switched GH actions back to ubuntu-latest, removed tests with 2.7 and 3.6, and added 3.11 * some little fixes in Citus documentation and behave tests v3.0.0	2023-01-30 10:29:08 +01:00
Alexander Kukushkin	45e5ac2baf	Remove patronictl scaffold (#2544 ) The only reason for having it was a hacky way of running standby clusters.	2023-01-27 08:52:59 +01:00
Alexander Kukushkin	4c3af2d1a0	Change master->primary/leader/member (#2541 ) keep as much backward compatibility as possible. Following changes were made: 1. All internal checks are performed as `role in ('master', 'primary')` 2. All internal variables/functions/methods are renamed 3. `GET /metrics` endpoint returns `patroni_primary` in addition to `patroni_master`. 4. Logs are changed to use leader/primary/member/remote depending on the context 5. Unit-tests are using only role = 'primary' instead of 'master' to verify that 1 works. 6. patronictl still supports old syntax, but also accepts `--leader` and `--primary`. 7. `master_(start\|stop)_timeout` is automatically translated to `primary_(start\|stop)_timeout` if the last one is not set. 8. updated the documentation and some examples Future plan: in the next major release switch role name from `master` to `primary` and maybe drop `master` altogether. The Kubernetes implementation will require more work and keep two labels in parallel. Label values should probably be configurable as described in https://github.com/zalando/patroni/issues/2495.	2023-01-27 07:40:24 +01:00
Alexander Kukushkin	0273eac15e	Compatibility with pyinstaller (#2537 ) it doesn't like relative imports and not recognise `http.server` imported with `six`. The last one is explicitly added to the list of `hiddenimports()` and will break compatibility with python 2.7, which support will be dropped in the next Patroni release anyway. Close https://github.com/zalando/patroni/issues/2535	2023-01-26 16:35:30 +01:00
Alexander Kukushkin	79458688d1	Check unexpected exceptions in Patroni logs after behave (#2538 ) and make behave fail if there are anything unexpected found. In addition to that fix globing rule when uploading artifacts with logs.	2023-01-25 11:02:52 +01:00
Alexander Kukushkin	4872ac51e0	Citus integration (#2504 ) Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together: ``` /service/batman/ /service/batman/0/ /service/batman/0/initialize /service/batman/0/leader /service/batman/0/members/ /service/batman/0/members/m1 /service/batman/0/members/m2 /service/batman/ /service/batman/1/ /service/batman/1/initialize /service/batman/1/leader /service/batman/1/members/ /service/batman/1/members/m1 /service/batman/1/members/m2 ... ``` Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups. Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper). The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group. Besides that we introduce a new method get_citus_coordinator(). It will be used only by worker clusters. Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates. E.g. ``` batman-0-leader # the leader config map for the coordinator batman-0-config # the config map holding initialize, config, and history "keys" ... batman-1-leader # the leader config map for worker group 1 batman-1-config ... ``` Citus integration is enabled from patroni.yaml: ```yaml citus: database: citus group: 0 # 0 is for coordinator, 1, 2, etc are for workers ``` If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file. When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section. Besides that, Patroni takes over management of some Postgres GUCs: - `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place - `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2` - wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal. The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using citus_add_node() and citus_update_node() functions. Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries. When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker. Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked. If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike. All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.	2023-01-24 16:14:58 +01:00
Alexander Kukushkin	3161f31088	Enhanced sync connections check (#2524 ) When `synchronous_standby_names` GUC is changed PostgreSQL nearly immediately starts reporting corresponding walsenders as synchronous, while in fact they maybe didn't reach this state yet. To mitigate this problem we memorize current flush lsn on the primary right after change of `synchronous_standby_names` got visible and use it as an additional check for walsenders. The walsender will be counted as truly "sync" only when write/flush/replay_lsn on it reached memorized LSN and the `application_name` is known to be a part of `synchronous_standby_names`. The size of PR mostly related to refactoring and moving the code responsible for working with `synchronous_standby_names` and `pg_stat_replication` to the dedicated file. And `parse_sync_standby_names()` function was mostly copied from #672.	2023-01-24 15:05:54 +01:00
Alexander Kukushkin	40d16443f9	Fixes and improvements in failsafe (#2532 ) 1. Fix problem with logical slots not advancing when only the primary lost access to DCS 2. Don't let Patroni to join as a raft voting member when running failsafe behave tests. It allows to test exactly the same conditions as for other DCS 3. Speed up dcs_failsafe_mode behave tests by getting rid from long sleeps, slight reshuffling of places when we start/stop outage, and by killing Patroni/Postgres to avoid long shutdown due to the leader key removal attempts.	2023-01-24 14:07:31 +01:00
Alexander Kukushkin	1e208736f8	Refactor drop_replication_slot() and _drop_incorrect_slots() (#2534 ) Use CTE to avoid running the second query if pg_drop_replication_slot() failed	2023-01-23 16:46:07 +01:00
William Albertus Dembo	f06d432dab	Keep only latest failed data directory (#2471 ) Use constant postfix when moving data directory due to failure so it only keeps data from the latest failure.	2023-01-19 21:47:41 +01:00
Polina Bungina	838653325a	Clean pg_replslot/ after pg_rewind (#2531 ) As pg_rewind cleans this directory on target only since pg11 Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>	2023-01-19 15:50:30 +01:00
Michael Banck	06bbe2eadc	Suppress recurring errors when dropping unknown but active replication slots (#2502 ) When a replication slot is not registered with Patroni but is active, Patroni would log an error during each HA cycle in certain conditions (after a restart or role change). To avoid this, first check if the replication slot we are about to drop is still active and if so, only log a warning. Otherwise, log the slot we are dropping for informational purposes. Close: #2499	2023-01-19 09:53:17 +01:00
Alexander Kukushkin	b75cd5a7d9	Submit coverage to codacy only if secret is available (#2528 ) If PR is open from the external GH repo secrets are not set due to security reasons. It makes codacy coverage report to fail. Co-authored-by: Polina Bungina <bungina@gmail.com>	2023-01-17 15:28:39 +01:00
Polina Bungina	acecbe0d8f	Fix a couple of linter problems, delete TODO.md (#2526 ) Fix a couple of linter problems, remove trailing whitespaces Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>	2023-01-17 10:52:03 +01:00
Alexander Kukushkin	2ea0357854	DCS failsafe mode (#2379 ) If enabled it will allow Patroni to cope with DCS outages. In case of a DCS outage the leader tries to call all remaining members in the cluster via API and if all of them respond with success the leader will not be demoted. The failsafe_mode could be enabled by running ```sh patronictl edit-config -s failsafe_mode=true ``` or by calling the `/config` REST API endpoint. Co-authored-by: Polina Bungina <bungina@gmail.com>	2023-01-13 13:35:05 +01:00
Polina Bungina	b13354b6a3	Make launch.sh pass shellcheck (#2522 )	2023-01-12 09:14:47 +01:00
Alexander Kukushkin	5bbb5dceeb	Improve /(a)sync checks in behave tests (#2521 ) They are frequently failing because sometimes replicas are a bit slow realizing that they are synchronous. Instead of instroducing more sleeps we will poll for required http status code with some timeout.	2023-01-12 08:23:59 +01:00
Polina Bungina	650344fca8	Update Slack link in README.rst and CONTRIBUTING.rst (#2520 ) * Update Slack link in README.rst and CONTRIBUTING.rst	2023-01-11 16:06:25 +01:00
Polina Bungina	9de22e667b	Report coverage to Codacy for behave tests (#2518 )	2023-01-11 11:47:08 +01:00
Alexander Kukushkin	c12fe4146d	Run only one query per HA loop (#2516 ) If the cluster is stable (no nodes are joining/leaving/lagging) we want to run at most one monitor query per every HA loop. So far it worker perfectly except when synchronous_mode is enabled, where we run two additional queries: 1. SHOW synchronous_mode 2. SELECT ... FROM pg_stat_replication In order to solve it, we will include these "queries" to the common monitoring query is synchronous_mode is enabled. In addition to that make sure that `synchronous_standby_names` is reset on replicas that used to be a primary and avoid using replicas which are not in the 'running' state. P.S.: in the monitoring query we also extract the current value of synchronous_standby_names, because it will be useful for the quorum commit feature. Close https://github.com/zalando/patroni/issues/2469	2023-01-10 10:44:17 +01:00
Alexander Kukushkin	baaf187c81	Fix behave tests on GH actions MacOS (#2515 ) - the new MacOS doesn't play well with old go binaries (bump etcd) - use brew to install Postgres and expect (unbuffer, to make behave output colorful) and use the latest version - upload failed logs instead of grepping them to stdout	2023-01-05 12:32:39 +01:00
Alexander Kukushkin	442bd3f434	Compatibility with some old modules (#2514 ) - old click differently handles argument names - old pytest doesn't like `from mock import call` Bump version and update release notes. Close: https://github.com/zalando/patroni/issues/2508 Close: https://github.com/zalando/patroni/issues/2512 v2.1.7	2023-01-04 07:24:52 +01:00
Michael Banck	e3e4ad0ada	Start etcd with V2 API enabled for V2 etcd acceptance tests (#2509 ) Otherwise, the etcd (not etcd3) behave tests fail to connect: ``` Jan 02 09:56:18 HOOK-ERROR in before_all: AssertionError: etcd instance is not available for queries after 5 seconds ```	2023-01-03 15:39:30 +01:00
Polina Bungina	bad158046e	Release v2.1.6 (#2507 ) * bump version * update release notes Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com> v2.1.6	2022-12-30 13:32:34 +01:00
Alexander Kukushkin	55e1549341	Do not rely on 'role' value when checking other nodes via REST API (#2503 ) When doing the leader race we need to check that the former primary isn't alive anymore. For that we relied on non-inclusive terms. In order to simplify future work on getting rid from all non-inclusive words we change the check to rely on a difference in format of wal/xlog field. There is only "location" for the primary and "replayed_location" + "received_location" for standbys. In addition to that we start supporting "wal" field as well as deprecated "xlog". Co-authored-by: Polina Bungina <bungina@gmail.com>	2022-12-29 09:13:09 +01:00
Alexander Kukushkin	2d79757309	The Consul TTL is off by twice from reality (#2501 ) we use `ttl/2.0` when setting the value on the HTTPClient, but forgot to multiply the current value by 2.	2022-12-27 12:06:29 +01:00

1 2 3 4 5 ...

2124 Commits