patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 18:20:05 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	333d41d9f0	Release 2.1.3 (#2219 ) * Implement missing unit-tests * Bump version * Update release notes v2.1.3	2022-02-18 14:16:15 +01:00
Alexander Kukushkin	aa91557a80	Fix bug in divergence timeline check (#2221 ) Patroni was falsely assuming that timelines have diverged. For pg_rewind it didn't create any problem, but if pg_rewind is not allowed and the `remove_data_directory_on_diverged_timelines` is set, it resulted in reinitializing the former leader. Close https://github.com/zalando/patroni/issues/2220	2022-02-17 15:53:13 +01:00
Hrvoje Milković	075918d447	Fixed AttributeError no attribute 'leader' (#2217 ) Close https://github.com/zalando/patroni/issues/2218	2022-02-16 10:20:15 +01:00
Michael Banck	c4535ae208	Avoid running CHECKPOINT on remote master if credentials are missing (#2195 ) Close #2194	2022-02-14 15:21:51 +01:00
Bastien Wirtz	38d84b1d15	Make sure no substitution attemps is made when params is empty. (#2212 ) Close #2209	2022-02-14 15:20:38 +01:00
Michael Banck	2d15e0dae6	Add target_session_attrs=read-write to standby_leader primary_conninfo (#2193 ) This allows to have multiple hosts in a standby_cluster and ensures that the standby leader follows the main cluster's new leader after a switchover. Partially addresses #2189	2022-02-10 15:50:14 +01:00
Michael Banck	48d8c13e6b	Write pgpass line per host if more than one is specified in connstr (#2192 ) Partly addresses #2189	2022-02-10 15:40:24 +01:00
Alexander Kukushkin	d3e3b4e16f	Minor tuning of tests (#2201 ) - Reduce verbosity for unit tests - Refactor GH actions config and try again macos behave tests	2022-02-10 15:38:16 +01:00
Alexandre Pereira	afab392ead	Add metrics (#2199 ) This PR adds metrics for additional information : - If a node or cluster is pending restart, - If the cluster management is paused. This may be useful for Prometheus/Grafana monitoring. Close #2198	2022-02-10 15:37:14 +01:00
Alexander Kukushkin	291754eeb0	Don't remove the leader lock while paused (#2187 ) Close https://github.com/zalando/patroni/issues/2179	2022-02-10 15:36:25 +01:00
Alexander Kukushkin	cdc80a1d89	Restart etcd3 watcher if all etcd nodes don't respond (#2186 ) Close https://github.com/zalando/patroni/issues/2180	2022-02-10 15:32:29 +01:00
Alexander Kukushkin	04c6f58b2b	Make Kubernetes.cancel_initialization() method similar to other DCS (#2210 ) I.e., do delete unconditionally and return the success	2022-02-10 15:29:29 +01:00
Ants Aasma	0980838cb3	Fix port in use error on certificate replacement (#2185 ) When switching certificates there is a race condition with a concurrent API request. If there is one active during the replacement period then the replacement will error out with a port in use error and Patroni gets stuck in a state without an active API server. Fix is to call server_close after shutdown which will wait for already running requests to complete before returning. Close #2184	2022-01-26 13:52:25 +01:00
Alexander Kukushkin	3e1076a574	Use replication credentials when checking leader status (#2165 ) It could be that `remove_data_directory_on_diverged_timelines` is set, but there is no `rewind_credentials` defined and superuser access between nodes is not allowed. Close https://github.com/zalando/patroni/issues/2162	2022-01-11 16:23:13 +01:00
Alexander Kukushkin	cb3071adfb	Annual cleanup (#2159 ) - Simplify setup.py: remove unneeded features and get rid of deprecation warnings - Compatibility with Python 3.10: handle `threading.Event.isSet()` deprecation - Make sure setup.py could run without `six`: move Patroni class and main function to the `__main__.py`. The `__init__.py` will have only a few functions used by the Patroni class and from the setup.py	2022-01-06 10:20:31 +01:00
Alexander Kukushkin	bf354aeebd	Compatibility with legacy psycopg2 (#2158 ) For example, psycopg2 installed from Ubuntu 18.04 packages doesn't have `UndefinedFile` exception yet.	2022-01-06 10:14:50 +01:00
Alexander Kukushkin	01d40a4a13	Compatibility with latest psutil and setuptools (#2155 ) Issues don't affect Patroni code, only unit-tests	2022-01-05 09:53:33 +01:00
Alexander Kukushkin	3cc14cc059	Unquote integers in validator (#2154 ) Close https://github.com/zalando/patroni/issues/2150	2022-01-04 10:47:02 +01:00
Alexander Kukushkin	a015e0e271	Fix bug with failover to cascading standby (#2138 ) When figuring out which slots should be created on cascading standby we forgot to take into account that the leader might be absent. Close: https://github.com/zalando/patroni/issues/2137	2021-12-21 11:20:35 +01:00
Alexander Kukushkin	d2b681b07e	Fix bug in the bootstrap standby-leader (#2144 ) When starting postgres after bootstrap of the standby-leader the `follow()` method is used to always return `True`. This behavior was changed in the #2054 in order to avoid hammering logs if postgres is failing to start. Since now the method returns `None` if postgres didn't start accepting connections after 60s, the change broke the standby-leader bootstrap code. As the solution, we will assume that the clone was successful if the `follow()` method returned anything different from `False`.	2021-12-21 11:20:06 +01:00
Alexander Kukushkin	63586f0477	Add ctl.keyfile_password support (#2145 ) It compliments restapi.keyfile_password added in the #1825	2021-12-21 11:19:39 +01:00
Alexander Kukushkin	4215565cb4	Rearrange tests (#2146 ) - remove codacy steps: they removed legacy organizations and there seems to be no easy way of installing codacy app to the Zalando GH. - Don't run behave on MacOS: recently worker became way to slow - Disable behave for combination of kubernetes and python 2.7 - Remove python 3.5 (it will be removed by GH from workers in January) and add 3.10 - Run behave with 3.6 and 3.9 instead of 3.5 and 3.8	2021-12-21 09:36:22 +01:00
Alexander Kukushkin	dc9ff4cb8a	Release 2.1.2 (#2136 ) * Implement missing unit-tests * Bump version * Update release notes v2.1.2	2021-12-03 15:49:57 +01:00
Alexander Kukushkin	d7dc3c2d96	Handle missing timelines in history file when deciding to rewind (#2120 ) When restore_command is configured Postgres is trying to fetch/apply all possible WAL segments and also fetch history files in order to select the correct timeline. It could result in a situation where the new history file will be missing some timelines. Example: - node1 demotes/crashes on timeline 1 - node2 promotes to timeline 2 and archives `00000002.history` and crashes - node1 recovers as a replica, "replays" `00000002.history` and promotes to timeline 3 As a result, the `00000003.history` will not have the line with timeline 2, because it never replayed any WAL segment from it. The `pg_rewind` tool is supposed to correctly handle such case when rewinding node2 from node1, but Patroni when deciding whether the rewind should happen was searching for the exact timeline in the history file from the new primary. The solution is to assume that rewind is required if the current replica timeline is missing. In addition to that this PR makes sure that the primary isn't running in recovery before starting the procedure of rewind check. Close https://github.com/zalando/patroni/issues/2118 and https://github.com/zalando/patroni/issues/2124	2021-12-02 11:35:30 +01:00
Michael Banck	90b3736fec	Remove duplicate hosts from the etcd machine cache (#2127 ) Close #2126	2021-12-02 11:35:11 +01:00
Alexander Kukushkin	63ee42a85c	Clear event on the leader node when /status was updated (#2125 ) Not doing so causing excessive HA loop runs with Zookeeper. This moment wasn't fixed correctly in the #1875	2021-11-30 16:33:38 +01:00
Alexander Kukushkin	d24051c31c	Optimize case when we don't have permanent logical slots (#2121 ) The unnecessary call of SlotsHandler.process_permanent_slots() results in one additional query to `pg_replication_slots` view every HA loop.	2021-11-30 14:20:55 +01:00
Alexander Kukushkin	31d7540cc5	Prefer members without nofailover when picking sync nodes (#2108 ) Previously sync nodes were selected only based on replication lag and hence the node with `nofailover` tag had the same chances to become synchronous as any other node. That behavior was confusing and dangerous at the same time, because in case of failed primary the failover couldn't happen automatically. Close https://github.com/zalando/patroni/issues/2089	2021-11-30 14:20:03 +01:00
Alexander Kukushkin	256a359a1e	Fix a litle bug around psycopg 3.0 (#2123 ) Cursor.execute now returns cursor itself, while in psycopg2 it was returning None	2021-11-19 16:29:39 +01:00
Alexander Kukushkin	17e523b175	Optimize checkpoint after promote (#2114 ) 1. Avoid doing CHECKPOINT if `pg_control` is already updated. 2. Explicitly call ensure_checkpoint_after_promote() right after the bootstrap finished successfully.	2021-11-19 14:33:24 +01:00
Alexander Kukushkin	fce889cd04	Compatibility with psycopg 3.0 (#2088 ) By default `psycopg2` is preferred. The `psycopg>=3.0` will be used only if `psycopg2` is not available or its version is too old.	2021-11-19 14:32:54 +01:00
Alexander Kukushkin	edfe2a84e9	Fix a few issues with Patroni API (#2116 ) 1. The `client_address` tuple may have more than two elements in case of IPv6 2. Return `cluster_unlocked` only when the value is true and handle it respectively in the do_GET_metrics() 3. Return `cluster_unlocked` and `dcs_last_seen` even if Postgres isn't running/queries timing out Close https://github.com/zalando/patroni/issues/2113	2021-11-12 15:02:53 +01:00
Alexander Kukushkin	00d125c512	Avoid unnecessary updates of the members ZNode. (#2115 ) When deciding whether the ZNode should be updated we rely on the cached version of the cluster, which is updated only when members ZNodes are deleted/created or the `/status`, `/sync`, `/failover`, `/config`, or `/history` ZNodes are updated. I.e. after the update of the current member ZNode succeeded the cache becomes stale and all further updates are always performed even if the value didn't change. In order to solve it, we introduce the new attribute in the Zookeeper class and will use it for memorizing the actual value and for later comparison.	2021-11-12 15:00:54 +01:00
Alexander Kukushkin	fd1e0f1c1b	BUGFIX: use_unix_socket_repl didn't work is some cases (#2103 ) Specifically, if `postgresql.unix_socket_directories` is not set. In this case Patroni is supposed to use only the port in the connection string, but the `get_replication_connection_cursor()` method defaulted to host='localhost'	2021-10-29 12:09:38 +02:00
Alwyn Davis	14bb28c349	Allow setting ACLs for znodes in Zookeeper (#2086 ) Add a configuration option (`set_acls`) for Zookeeper DCS so that Kazoo will apply a default ACL for each znode that it creates. The intention is to improve security of the znodes when a single Zookeeper cluster is used as the DCS for multiple Patroni clusters. Zookeeper [does not apply an ACL to child znodes](https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#sc_ZooKeeperAccessControl), so permissions can't be set at the `scope` level and then be inherited by other znodes that Patroni creates. Kazoo instead [provides an option for configuring a default_acl](https://kazoo.readthedocs.io/en/latest/api/client.html#kazoo.client.KazooClient.__init__) that will be applied on node creation. Example configuration in Patroni might then be: ``` zookeeper: set_acls: CN=principal1: [ALL] CN=principal2: - READ ```	2021-10-28 09:59:45 +02:00
Michael Banck	95479526ef	Fix typo (#2091 )	2021-10-26 15:12:10 +02:00
Alexander Kukushkin	47ebda0d5d	Fix a few issues in kubernetes.py (#2084 ) 1. Two `TypeError`-s raised from `ApiClient.request()` method 2. Use the _retry() wrapper function instead of callable object in the `_update_leader_with_retry()` when trying to workaround concurrent updates of the leader object.	2021-10-08 16:13:28 +02:00
Nicolas PAYART	64ae2bb885	Add compatibility with PG 14 in README (#2083 ) It was just missing there	2021-10-08 15:51:55 +02:00
Alexander Kukushkin	250328b84b	Use cached role as a fallback when postgres is slow (#2082 ) In some extreme cases Postgres could be so slow that the normal monitoring query doesn't finish in a few seconds. It results in the exception being raised from the `Postgresql._cluster_info_state_get()` method, which could lead to the situation that postgres isn't demoted on time. In order to make it reliable we will catch the exception and use the cached state of postgres (`is_running()` and `role`) to determine whether postgres is running as a primary. Close https://github.com/zalando/patroni/issues/2073	2021-10-07 16:08:21 +02:00
Alexander Kukushkin	89388c2e4b	Handle DCS exceptions when demoting (#2081 ) While doing demote due to failure to update leader lock it could happen that DCS goes completely down and the get_cluster() call raise the exception. Not being properly handled it results in postgres remaining stopped until DCS recovers.	2021-10-07 16:08:10 +02:00
Michael Banck	e28557d2f0	Fix sphinx build. (#2080 ) Sphinx' add_stylesheet() has been deprecated for a long time and got removed in recent versions of sphinx. If available, use add_css_file() instead. Close #2079.	2021-10-07 16:07:41 +02:00
Farid Zarazvand	34db0bba16	PostgreSQL v14 is supported since v2.1.0 (#2078 )	2021-10-07 16:07:00 +02:00
Kostiantyn Nemchenko	3616906434	Add sslcrldir connection parameter support (#2068 ) This allows setting the `sslcrldir` connection parameter available since PostgreSQL 14.	2021-10-07 16:04:27 +02:00
Alexander Kukushkin	d394b63c9f	Release the leader lock when pg_controldata reports "shut down" (#2067 ) Due to different reasons, it could happen that WAL archiving on the primary stuck or significantly delayed. If we try to do a switchover or shut it down, the shutdown will take forever and will not finish until the whole backlog of WALs is processed. In the meantime, Patroni keeps updating the leader lock, which prevents other nodes from starting the leader race even if it is known that they received/applied all changes. The `Database cluster state:` is changed to `"shut down"` after: - all data is fsynced to disk and the latest checkpoint is written to WAL - all streaming replicas confirmed that they received all changes (including the latest checkpoint) - at the same time, the archiver process continues to do its job and the postmaster process is still running. In order to solve this problem and make the switchover more reliable/fast in a case when `archive_command` is slow/failing, Patroni will remove the leader key immediately after `pg_controldata` started reporting PGDATA as `"shut down"` cleanly and it verified that there is at least one replica that received all changes. If there are no replicas that fulfill the condition the leader key isn't removed and the old behavior is retained, i.e. Patroni will keep updating it.	2021-10-05 10:55:35 +02:00
Alexander Kukushkin	1c2bf258d6	Allow switchover only to sync nodes when synchronous replication is on (#2076 ) Close https://github.com/zalando/patroni/issues/2074	2021-10-04 16:23:45 +02:00
Alexander Kukushkin	a431f50378	Check only sync nodes when assessing failover capabilities (#2065 ) When synchronous_mode is enabled we should check only synchronous nodes in is_failover_possible().	2021-09-24 08:22:54 +02:00
Alexander Kukushkin	fca724186e	DCS.write_leader_optime() should update /status key (#2064 ) This moment was forgotten in the failover logical slots implementation.	2021-09-24 08:22:20 +02:00
Michael Banck	2f31e88bdc	Add dcs_last_seen field to API (#2051 ) This field notes the last time (as unix epoch) a cluster member has successfully communicated with the DCS. This is useful to identify and/or analyze network partitions. Also, expose dcs_last_seen in the MemberStatus class and its from_api_response() method.	2021-09-22 10:01:35 +02:00
Jorge Solórzano	80c1127b70	Cast to int wal_keep_segments conversion to wal_keep_size (#2063 ) Fixes #2062	2021-09-22 10:00:42 +02:00
Alexander Kukushkin	258e7e24f4	Ensure pg_replication_slot_advance() doesn't timeout (#2060 ) The bigger gap between the slot flush LSN and the LSN we want to advance to becomes more time it takes for the call to finish. Once started failing the "lag" will grow more or less infinitely, that have the following negative side-effects: 1. Size of pg_wal on the replica will grow 2. Since the hot_standby_feedback is forcefully enabled, the primary will stop cleaning up dead tuples I.e., we are not only in danger of running out of disk space, but also increasing chances of transaction wraparound to happen. In order to mitigate it, we want to set the `statement_timeout` to 0 before calling `pg_replication_slot_advance()`. Since the call is happening from the main HA loop and could take more than `loop_wait`, the next heartbeat run could be delayed. There is also a possibility that the call could take longer than `ttl` and the member key/session in DCS for a given replica expires, but, the slot LSN in DCS is updated by the primary every `loop_wait` seconds. Hence, we don't expect that the slot_advance() call will take significantly longer than the `loop_wait` and therefore chances of the member key/session to expire are very low.	2021-09-17 16:39:39 +02:00

1 2 3 4 5 ...

1995 Commits