patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 10:20:10 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	76ed5784bb	arm64 compatibility	2022-06-24 11:36:20 +02:00
Lev Kozlov	b8a6387236	Bump Postgres version in Dockerfile to 14 (#2333 )	2022-06-13 15:26:01 +02:00
monsterxx03	c7ee5f008d	Handle expired token for etcd lease_grant (#2331 ) (#2332 ) Close #2331	2022-06-13 14:58:11 +02:00
Michael Banck	a77fbb1912	Fix markup - the -status is part of the command (#2323 )	2022-06-13 14:57:28 +02:00
Alexander Kukushkin	fb06af9adb	Release 2.1.4 (#2322 ) - bump version - update release notes - implement missing unit-tests v2.1.4	2022-06-01 16:00:56 +02:00
Dennis4b	b42550aad4	Add /read-only-sync endpoint (#2305 ) (#2311 ) `/read-only-sync` mirrors `/read-only`, but only returns `200` on a replica if this replica is a synchronous standby.	2022-05-30 17:09:43 +02:00
Alexander Kukushkin	f67174d7cc	Use replication credential in divergance check only with v10 and older (#2308 ) and document in which case pg_hba.conf should allow access to "postgres" database with replication credentials. Close #2261	2022-05-20 10:24:49 +02:00
Alexander Kukushkin	4c14b302f6	Remove [] characters from IPv6 hosts when splitting to host and port (#2309 ) Close #2269	2022-05-20 10:24:25 +02:00
Alexander Kukushkin	729f1dddc8	Compatibility with PostgreSQL 15 beta1 (#2299 ) * update postgresql/validator.py * pg_rewind doesn't like if there are unix sockets in PGDATA * pg_rewind now supports --config-file option	2022-05-19 15:36:09 +02:00
Alexander Kukushkin	ad3d953410	K8s: reset watchers if PATCH fails with 409 (#2283 ) High CPU load on Etcd nodes and K8s API servers created a very strange situation. A few clusters were running without a leader and the pod which is ahead of others was failing to take a leader lock because updates were failing with HTTP response code `409` (`resource_version` mismatch). Effectively that means that TCP connections to K8s master nodes were alive (in the opposite case tcp keepalives would have resolved it), but no `UPDATE` events were arriving via these connections, resulting in the stale cache of the cluster in memory. The only good way to prevent this situation is to intercept 409 HTTP responses and terminate existing TCP connections used for watches. Now a few words about implementation. Unfortunately, watch threads are waiting in the read() call most of the time and there is no good way to interrupt them. But, the `socket.shutdown()` seems to do this job. We already used this trick in the Etcd3 implementation. This approach will help to mitigate the issue of not having a leader, but at the same time replicas might still end up with the stale cluster state cached and in the worst case will not stream from the leader. Non-streaming replicas are less dangerous and could be covered by monitoring and partially mitigated by correctly configured `archive_command` and `restore_command`.	2022-05-19 15:24:20 +02:00
Alexander Kukushkin	ef3401c17f	Don't reset slots annotation if postgres isn't ready (#2306 ) The current state of permanent logical replication slots on the primary is queried together with `pg_current_wal_lsn()` and hence they "fail" simultaneously if Postgres isn't yet ready for accepting connections and in this case we want to avoid updating the `/status` key altogether. On K8s we don't use a dedicated object for the `/status` key, but use the same object (Endpoint or ConfigMap) as for the leader. If the `last_lsn` isn't set we avoid patching the corresponding annotation, but, the `slots` annotation was reset due to the oversight.	2022-05-19 15:06:59 +02:00
Alexander Kukushkin	496d14e6ca	Better handling of failed pg_rewind attempt (#2304 ) Close #2302	2022-05-19 14:52:26 +02:00
Alexander Kukushkin	6e8b2ce0a4	Don't try to run crash recovery if postgres is running (#2298 ) It was a minor oversight of #2252 and didn't get to a release	2022-05-19 13:46:49 +02:00
Alexander Kukushkin	96b75fa7cb	Special handling of check_recovery_conf for v12+ (#2292 ) When starting as a replica it may take some time before Postgres starts accepting new connections, but meanwhile, it could happen that the leader transitioned to a different member and the `primary_conninfo` must be updated. On pre v12 Patroni regularly checks `recovery.conf` in order to check that recovery parameters match the expectation. Starting from v12 recovery parameters were converted to GUC's and Patroni gets current values from the `pg_settings` view. The last one creates a problem when it takes more than a minute for Postgres to start accepting new connections. Since Patroni attempts to execute at least `pg_is_in_recovery()` every HA loop, and it is raising at exception, the `check_recovery_conf()` effectively wasn't reachable until recovery is finished, but it changed when #2082 was introduced. As a result of #2082 we got the following behavior: 1. Up to v12 (not including) everything was working as expected 2. v12 and v13 - Patroni restarting Postgres after 1m of recovery 3. v14+ - the `check_recovery_conf()` is not executed because the `replay_paused()` method raising an exception. In order to properly handle changes of recovery parameters or leader transitioned to a different node on v12+, we will rely on the cached values of recovery parameters until Postgres becomes ready to execute queries. Close https://github.com/zalando/patroni/issues/2289	2022-05-12 07:45:49 +02:00
Alexander Kukushkin	b901e62ad0	Enhanced checks of replica logical slots safety (#2285 ) The logical slot on a replica is safe to use when the physical replica slot on the primary: 1. has a nonzero/non-null `catalog_xmin` 2. has a `catalog_xmin` that is not newer (greater) than the `catalog_xmin` of any slot on the standby 3. the `catalog_xmin` is known to overtake `catalog_xmin` of logical slots on the primary observed during `1` In case if `1` doesn't take place, Patroni will run an additional check whether the `hot_standby_feedback` is actually in effect and shows the warning in case it is not.	2022-05-10 12:24:47 +02:00
haslersn	7a2d6dc3c0	Ensure that optime annotation is a string (#2291 ) Fixes #2290	2022-05-10 09:20:39 +02:00
Haitao Li	aa0cd48060	k8s: Support refreshing service account tokens (#2287 ) Since Kubernetes v1.21, with projected service account token feature, service account tokens expire in 1 hour. Kubernetes clients are expected to reread the token file to refresh the token. This patch re-reads the token file very minute for in-cluster config. Fixes #2286 Signed-off-by: Haitao Li <hli@atlassian.com>	2022-05-05 17:35:06 +02:00
Alexander Kukushkin	5f6197aaad	Don't copy logical slot if there is mismatch with the config (#2274 ) A couple of times we have seen in the wild that the database for the permanent logical slots was changed in the Patroni config. It resulted in the below situation. On the primary: 1. The slot must be dropped before creating it in a different DB. 2. Patroni fails to drop it because the slot is in use. Replica: 1. Patroni notice that the slot exists in the wrong DB and successfully dropping it. 2. Patroni copying the existing slot from the primary by its name with Postgres restart. And the loop repeats while the "wrong" slot exists on the primary. Basically, replicas are continuously restarting, which badly affects availability. In order to solve the problem, we will perform additional checks while copying replication slot files from the primary and discard them if `slot_type`, `database`, or `plugin` don't match our expectations.	2022-04-14 12:10:37 +02:00
Alexander Kukushkin	aea0589404	Switch to boto3 (#2275 ) Close https://github.com/zalando/patroni/issues/2237	2022-04-14 10:47:16 +02:00
zejeanmi	40b5db4b85	Add ppc64le and fix inverted IOC_READ/WRITE vars (#2271 ) Close #2265	2022-04-14 10:46:01 +02:00
Wesley Mendes	e491edd1bf	Fixes missing import of dateutil.parser (#2259 ) Close #2258	2022-04-14 10:45:00 +02:00
James Stroud	0057f9018b	Spell out DCS (#2228 )	2022-03-24 13:58:10 +01:00
grembo	c4e208ec50	Allow setting TLSServerName on consul service checks (#2231 ) See also https://www.consul.io/api-docs/agent/check#tlsservername Useful in case checks are done by IP and the consul `node_name` is not an FQDN.	2022-03-24 13:57:17 +01:00
Gunnar "Nick" Bluth	7626b5fef8	Fix pg_rewind on typical Debian/Ubuntu systems (#2225 ) On Debian/Ubuntu systems it is common to keep Postgres config files outside of the data directory. It created a couple of problems for pg_rewind support in Patroni. 1. The `--config_file` argument must be supplied while figuring out the `restore_command` GUC value on Postgres v12+ 2. With Postgres v13+ pg_rewind by itself can't find postgresql.conf in order to figure out `restore_command` and therefore we have to use Patroni as a fallback for fetching missing WAL's that are required for rewind. This commit addresses both problems.	2022-03-24 13:56:16 +01:00
Alexander Kukushkin	81912c9cae	Handle rewind when demoted node was shut down (#2252 ) In case of DCS unavailability Patroni restarts Postgres in read-only. It will cause pg_control to be updated with the `Database cluster state: in archive recovery` and also could set the `MinRecoveryPoint`. When the next time Patroni is started it will assume that Postgres was running as a replica and rewind isn't required and will try to start the Postgres up. In this situation there is the chance that the start will be aborted with the FATAL error message that looks like `requested timeline 2 does not contain minimum recovery point 0/501E8B8 on timeline 1`. On the next heart-beat Patroni will again notice that Postgres isn't running which would lead to another start-fail attempt. This loop is endless. In order to mitigate the problem we do the following: 1. While figuring out whether the rewind is required we consider `in archive recovery` along with `shut down in recovery`. 2. If pg_rewind is required and the cluster state is `in archive recovery` we also perform recovery in a single-user mode. Close https://github.com/zalando/patroni/issues/2242	2022-03-24 13:51:59 +01:00
Alexander Kukushkin	333d41d9f0	Release 2.1.3 (#2219 ) * Implement missing unit-tests * Bump version * Update release notes v2.1.3	2022-02-18 14:16:15 +01:00
Alexander Kukushkin	aa91557a80	Fix bug in divergence timeline check (#2221 ) Patroni was falsely assuming that timelines have diverged. For pg_rewind it didn't create any problem, but if pg_rewind is not allowed and the `remove_data_directory_on_diverged_timelines` is set, it resulted in reinitializing the former leader. Close https://github.com/zalando/patroni/issues/2220	2022-02-17 15:53:13 +01:00
Hrvoje Milković	075918d447	Fixed AttributeError no attribute 'leader' (#2217 ) Close https://github.com/zalando/patroni/issues/2218	2022-02-16 10:20:15 +01:00
Michael Banck	c4535ae208	Avoid running CHECKPOINT on remote master if credentials are missing (#2195 ) Close #2194	2022-02-14 15:21:51 +01:00
Bastien Wirtz	38d84b1d15	Make sure no substitution attemps is made when params is empty. (#2212 ) Close #2209	2022-02-14 15:20:38 +01:00
Michael Banck	2d15e0dae6	Add target_session_attrs=read-write to standby_leader primary_conninfo (#2193 ) This allows to have multiple hosts in a standby_cluster and ensures that the standby leader follows the main cluster's new leader after a switchover. Partially addresses #2189	2022-02-10 15:50:14 +01:00
Michael Banck	48d8c13e6b	Write pgpass line per host if more than one is specified in connstr (#2192 ) Partly addresses #2189	2022-02-10 15:40:24 +01:00
Alexander Kukushkin	d3e3b4e16f	Minor tuning of tests (#2201 ) - Reduce verbosity for unit tests - Refactor GH actions config and try again macos behave tests	2022-02-10 15:38:16 +01:00
Alexandre Pereira	afab392ead	Add metrics (#2199 ) This PR adds metrics for additional information : - If a node or cluster is pending restart, - If the cluster management is paused. This may be useful for Prometheus/Grafana monitoring. Close #2198	2022-02-10 15:37:14 +01:00
Alexander Kukushkin	291754eeb0	Don't remove the leader lock while paused (#2187 ) Close https://github.com/zalando/patroni/issues/2179	2022-02-10 15:36:25 +01:00
Alexander Kukushkin	cdc80a1d89	Restart etcd3 watcher if all etcd nodes don't respond (#2186 ) Close https://github.com/zalando/patroni/issues/2180	2022-02-10 15:32:29 +01:00
Alexander Kukushkin	04c6f58b2b	Make Kubernetes.cancel_initialization() method similar to other DCS (#2210 ) I.e., do delete unconditionally and return the success	2022-02-10 15:29:29 +01:00
Ants Aasma	0980838cb3	Fix port in use error on certificate replacement (#2185 ) When switching certificates there is a race condition with a concurrent API request. If there is one active during the replacement period then the replacement will error out with a port in use error and Patroni gets stuck in a state without an active API server. Fix is to call server_close after shutdown which will wait for already running requests to complete before returning. Close #2184	2022-01-26 13:52:25 +01:00
Alexander Kukushkin	3e1076a574	Use replication credentials when checking leader status (#2165 ) It could be that `remove_data_directory_on_diverged_timelines` is set, but there is no `rewind_credentials` defined and superuser access between nodes is not allowed. Close https://github.com/zalando/patroni/issues/2162	2022-01-11 16:23:13 +01:00
Alexander Kukushkin	cb3071adfb	Annual cleanup (#2159 ) - Simplify setup.py: remove unneeded features and get rid of deprecation warnings - Compatibility with Python 3.10: handle `threading.Event.isSet()` deprecation - Make sure setup.py could run without `six`: move Patroni class and main function to the `__main__.py`. The `__init__.py` will have only a few functions used by the Patroni class and from the setup.py	2022-01-06 10:20:31 +01:00
Alexander Kukushkin	bf354aeebd	Compatibility with legacy psycopg2 (#2158 ) For example, psycopg2 installed from Ubuntu 18.04 packages doesn't have `UndefinedFile` exception yet.	2022-01-06 10:14:50 +01:00
Alexander Kukushkin	01d40a4a13	Compatibility with latest psutil and setuptools (#2155 ) Issues don't affect Patroni code, only unit-tests	2022-01-05 09:53:33 +01:00
Alexander Kukushkin	3cc14cc059	Unquote integers in validator (#2154 ) Close https://github.com/zalando/patroni/issues/2150	2022-01-04 10:47:02 +01:00
Alexander Kukushkin	a015e0e271	Fix bug with failover to cascading standby (#2138 ) When figuring out which slots should be created on cascading standby we forgot to take into account that the leader might be absent. Close: https://github.com/zalando/patroni/issues/2137	2021-12-21 11:20:35 +01:00
Alexander Kukushkin	d2b681b07e	Fix bug in the bootstrap standby-leader (#2144 ) When starting postgres after bootstrap of the standby-leader the `follow()` method is used to always return `True`. This behavior was changed in the #2054 in order to avoid hammering logs if postgres is failing to start. Since now the method returns `None` if postgres didn't start accepting connections after 60s, the change broke the standby-leader bootstrap code. As the solution, we will assume that the clone was successful if the `follow()` method returned anything different from `False`.	2021-12-21 11:20:06 +01:00
Alexander Kukushkin	63586f0477	Add ctl.keyfile_password support (#2145 ) It compliments restapi.keyfile_password added in the #1825	2021-12-21 11:19:39 +01:00
Alexander Kukushkin	4215565cb4	Rearrange tests (#2146 ) - remove codacy steps: they removed legacy organizations and there seems to be no easy way of installing codacy app to the Zalando GH. - Don't run behave on MacOS: recently worker became way to slow - Disable behave for combination of kubernetes and python 2.7 - Remove python 3.5 (it will be removed by GH from workers in January) and add 3.10 - Run behave with 3.6 and 3.9 instead of 3.5 and 3.8	2021-12-21 09:36:22 +01:00
Alexander Kukushkin	dc9ff4cb8a	Release 2.1.2 (#2136 ) * Implement missing unit-tests * Bump version * Update release notes v2.1.2	2021-12-03 15:49:57 +01:00
Alexander Kukushkin	d7dc3c2d96	Handle missing timelines in history file when deciding to rewind (#2120 ) When restore_command is configured Postgres is trying to fetch/apply all possible WAL segments and also fetch history files in order to select the correct timeline. It could result in a situation where the new history file will be missing some timelines. Example: - node1 demotes/crashes on timeline 1 - node2 promotes to timeline 2 and archives `00000002.history` and crashes - node1 recovers as a replica, "replays" `00000002.history` and promotes to timeline 3 As a result, the `00000003.history` will not have the line with timeline 2, because it never replayed any WAL segment from it. The `pg_rewind` tool is supposed to correctly handle such case when rewinding node2 from node1, but Patroni when deciding whether the rewind should happen was searching for the exact timeline in the history file from the new primary. The solution is to assume that rewind is required if the current replica timeline is missing. In addition to that this PR makes sure that the primary isn't running in recovery before starting the procedure of rewind check. Close https://github.com/zalando/patroni/issues/2118 and https://github.com/zalando/patroni/issues/2124	2021-12-02 11:35:30 +01:00
Michael Banck	90b3736fec	Remove duplicate hosts from the etcd machine cache (#2127 ) Close #2126	2021-12-02 11:35:11 +01:00

1 2 3 4 5 ...

2020 Commits