patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 10:20:10 +00:00

Author	SHA1	Message	Date
OutBackDingo	3a9befc1a2	update citus images and hosts, to generate helm charts	2025-04-25 10:32:39 +07:00
Alexander Kukushkin	8f22fd255e	Ignore stale Etcd nodes by comparing cluster term (#3318 ) During the network split some Etcd nodes could become stale, however they are still available for read requests. It is not a problem for the old primary, because such etcd nodes are not writable and primary demotes when fails to update the leader lock. When the network split resolves and Etcd node connects back to the cluster, it may trigger the leader election (in Etcd), what typically results in some failures of client requests. Such state quickly resolves, and client requests could be retried. but it takes some time for the disconnected node to catch up. During this time it shows stale data for read requests without quorum requirement. It could happen that the current Patroni primary is impacted by failed client requests while Etcd cluster is doing leader elections and there is a chance that it may switch to the stale Etcd node, discover that someone else is a leader, and demote. To protect from this situation we will memorize the last known "term" of the Etcd cluster and when executing client requests we will compare the "term" reported by Etcd node with the term we memorized. It allows to detect stale Etcd nodes and temporary ignore them by switching to some other available Etcd nodes. An alternative approach to solve this problem would be using quorum/serializable reads for read requests, but it will increase resource usage on Etcd nodes. Close #3314	2025-04-20 09:29:43 +02:00
Polina Bungina	6938c21ff7	Convert roles to enums (#3303 )	2025-04-18 17:12:43 +02:00
Alexander Kukushkin	32934b205f	Limit py-consul version depending on python version (#3336 ) Latest release is incompatible with python < 3.9	2025-04-18 16:48:32 +02:00
zhaowcheng	1ba9e68b4b	Fix some errors in patroni_configuration.rst (#3325 ) Fix the following errors in section "PostgreSQL parameters controlled by Patroni": 1. Supplement the missing parameter `wal_log_hints`. 2. In fact, these controlled parameters are written into `postgresql.conf`. 3. In fact, these controlled parameters are passed as a list of arguments to the `postgres` (not `pg_ctl start`). 4. Add a note about that `wal_keep_segments` and `wal_keep_size` are not passed to the `postgres`.	2025-04-17 13:26:54 +02:00
Alexander Kukushkin	deb9cc6b73	Fix bug with switchover in synchronous_mode=quorum (#3310 ) When the candidate is specified we don't have to check quorum requirements. The problem was introduced in #3278 Close #3307	2025-03-24 08:50:49 +01:00
Michael Banck	a3c772dfc9	Fix permissions of out-of-PGDATA created postgresql.conf. (#3308 ) Since `01d07f86c`, the permissions of postgresql.conf created in PGDATA was explicitly set. However, the umask of the Patroni process was adjusted as well and as a result of this, Patroni would write postgresql.conf with 600 permissions if the configuration files are outside PGDATA. Fix this by using the original umask as mode for files created outside PGDATA. Fixes: #3302	2025-03-14 12:21:25 +01:00
Alexander Kukushkin	9977850b56	Move initialization of global_config to Patroni class (#3309 ) we rely on it's value when creating instance of Postgresql class	2025-03-14 10:35:36 +01:00
Polina Bungina	7543e64000	Convert states to enums (#3293 ) - Postgresql._state - pg_isready state	2025-03-14 09:53:56 +01:00
Ronan Dunklau	5ed4d33f7d	Add support for systemd "notify" unit type (#3301 ) Close #3300	2025-03-12 17:28:54 +01:00
Polina Bungina	c6943dc415	Implement --print option for --validate-config (#3296 )	2025-02-28 14:49:11 +01:00
Alexander Kukushkin	36011e936a	Update config files on SIGHUP (#3299 ) Currently Patroni replaces config files only if it detected a change in global configuration + patroni.yaml, however it could be that configs on filesystem were updated by humans and we want to "restore" them.	2025-02-28 11:42:45 +01:00
Alexander Kukushkin	a316105412	Fix bug with priority failover (#3297 ) We should ignor the former leader with higher priority when it reports the same LSN as the current node. This bug could be a contributing factor to issues described in #3295 In addition to that mock socket.getaddrinfo() call in test_api.py to avoid hitting DNS servers.	2025-02-28 09:48:16 +01:00
Garaz08	92c4f9fbb5	Solve a couple of Flaky unit tests (#3294 )	2025-02-25 15:39:46 +01:00
Sophia Ruan	1c5d9f5653	fix typo: update recovery_target_timeline to recovery_target_action (#3292 ) In the Bootstrap doc description, there is a typo recovery_target_timeline in recovery_conf block, which should be recovery_target_action.	2025-02-24 13:52:32 +01:00
Polina Bungina	66cf21767d	Release v4.0.5 (#3286 ) - Icrease version - Add RNs - Update year in the copyright	2025-02-20 16:29:23 +01:00
Alexander Kukushkin	b573bd4c9d	Compatibility with python 3.6 (#3287 ) time.time_ns() is not available	2025-02-20 15:18:52 +01:00
Polina Bungina	33600976b1	Re-apply "Enable behave tests with Citus 13 and PostgreSQL 17" (#3285 ) This reverts commit `3d932e1e73`.	2025-02-20 11:58:58 +01:00
Alexander Kukushkin	e9ba775959	Fix a couple of bugs in quorum state machine (#3278 ) 1. when evaluating whether there are healthy nodes for a leader race before demoting we need to take into account quorum requirements. Without it the former leader may end up in recovery surrounded by asynchronous nodes. 2. QuorumStateResolver wasn't correctly handling the case when the replica node quickly joined and disconnected, what was resulting in the following errors: ``` File "/home/akukushkin/git/patroni/patroni/quorum.py", line 427, in _generate_transitions yield from self.__remove_gone_nodes() File "/home/akukushkin/git/patroni/patroni/quorum.py", line 327, in __remove_gone_nodes yield from self.sync_update(numsync, sync) File "/home/akukushkin/git/patroni/patroni/quorum.py", line 227, in sync_update raise QuorumError(f'Sync {numsync} > N of ({sync})') patroni.quorum.QuorumError: Sync 2 > N of ({'postgresql2'}) 2025-02-14 10:18:07,058 INFO: Unexpected exception raised, please report it as a BUG File "/home/akukushkin/git/patroni/patroni/quorum.py", line 246, in __iter__ transitions = list(self._generate_transitions()) File "/home/akukushkin/git/patroni/patroni/quorum.py", line 423, in _generate_transitions yield from self.__handle_non_steady_cases() File "/home/akukushkin/git/patroni/patroni/quorum.py", line 281, in __handle_non_steady_cases yield from self.quorum_update(len(voters) - self.numsync, voters) File "/home/akukushkin/git/patroni/patroni/quorum.py", line 184, in quorum_update raise QuorumError(f'Quorum {quorum} < 0 of ({voters})') patroni.quorum.QuorumError: Quorum -1 < 0 of ({'postgresql1'}) 2025-02-18 15:50:48,243 INFO: Unexpected exception raised, please report it as a BUG ```	2025-02-20 11:00:22 +01:00
Alexander Kukushkin	cf427e8b0b	Bump pyright to 1.1.394 (#3283 )	2025-02-19 17:04:19 +01:00
Polina Bungina	7531d41587	Pin sphinx to <8.2.0 (#3284 )	2025-02-19 16:34:00 +01:00
Polina Bungina	5dbfc9401b	Implement kubernetes.bootstrap_labels (#3257 ) Allow to define labels that will be assigned to a postgres instance pod when in 'initializing new cluster', 'running custom bootstrap script', 'starting after custom bootstrap', or 'creating replica' state	2025-02-18 09:37:22 +01:00
Alexander Kukushkin	ce79152088	Take advantage of written_lsn and latest_end_lsn from pg_stat_wal_receiver (#3268 ) The first one if available starting from PostgreSQL v13 and contains the real write LSN. We will prefer it over value returned by pg_last_wal_receive_lsn(), which is in fact flush LSN. The second one is available starting from PostgreSQL v9.6 and points to WAL flush on the source host. In case of primary it will allow to better calculate the replay lag, because values stored in DCS are updated only every loop_wait seconds.	2025-02-17 15:06:36 +01:00
Alexander Kukushkin	6920b3af0e	Cleanup after unit tests (#3277 ) Close https://github.com/patroni/patroni/issues/3276	2025-02-14 13:29:34 +01:00
Alexander Kukushkin	0d87270897	Don't touch logical failover slots (#3245 ) If logical replication slot is created with failover => true option, we get respective field set to true in `pg_replication_slots` view. By avoiding interacting with such slots we make logical failover slots feature fully functional in PG17.	2025-02-14 08:35:37 +01:00
Alexander Kukushkin	1a31ea6e20	Compatibility with latest changes in urlparse (#3275 ) It doesn't accept multiple hosts with [] character in URL anymore. To mitigate the problem we switch to native wrappers of PQconninfoParse() function from libpq when it is possible and use own implementation only when psycopg2 is too old.	2025-02-13 16:07:51 +01:00
Alexander Kukushkin	8de904e556	Improve replication_state=streaming check in behave (#3269 ) it was somewhat flaky	2025-02-10 11:04:58 +01:00
Michael Morris	c97ad83396	Add configuration option to suppress duplicate heartbeat logs (#3252 ) Close #3251	2025-02-04 16:25:08 +01:00
Alexander Kukushkin	0bb12473fb	Fix bug with slot for former leader not retained on failover (#3261 ) the problem existed because _build_retain_slots() method was falsely relying on members being present in DCS, while on failover the member key for the former leader is expiring exactly at the same time.	2025-02-04 13:39:19 +01:00
Alexander Kukushkin	302757b71a	Handle all exceptions raised by subprocess in controldata() method (#3267 ) Close #3264	2025-02-04 13:38:59 +01:00
Polina Bungina	3d932e1e73	Temp revert of "Enable behave tests with Citus 13 and PostgreSQL 17" (#3265 ) but keep timeout increase	2025-02-03 08:44:02 +01:00
Alexander Kukushkin	38aef484e8	Fix a few little issues with 9.5 support (#3260 ) 1. pg_rewind error log format wasn't verbose 2. it doesn't support specifying num in synchronous_standby_names	2025-01-31 16:46:07 +01:00
Alexander Kukushkin	34b2a77294	Fix race condition in priority sync behave tests (#3263 ) don't try patching /config key before leader managed to create it.	2025-01-31 16:45:26 +01:00
Alexander Kukushkin	6caa2fa99c	Enable behave tests with Citus 13 and PostgreSQL 17 (#3262 ) Also increase timeout from 15m to 20m	2025-01-31 16:44:32 +01:00
Joe Jensen	b4eab48971	Fall through to default behavior when pyinstall toc is not found (#3256 ) Close #3255	2025-01-31 10:14:27 +01:00
Alexander Kukushkin	2bc25a32e4	Avoid dropping physical slots too early (#3244 ) Consider a situation: there is a permanent logical slot and primary and replica are temporary down. When Patroni is started on the former primary it starts Postgres in a standby mode, what leads to removal of physical replication slot for the replica because it has xmin. We should postpone removal of such physical slots: - on replica until there will be a leader in the cluster - on primary until Postgres is promoted	2025-01-30 13:08:30 +01:00
Alexander Kukushkin	7db7dfd3c5	Compatibility with python 3.13 (#3246 ) - fix unit tests (logging now uses time.time_ns() instead of time.time()) - update setup.py - update tox.ini - enable unix and behave tests with 3.13 Close https://github.com/patroni/patroni/issues/3243	2025-01-20 08:58:12 +01:00
Antoni Mur	3938bb9a16	Replace forward slash in cluster_name (#3247 )	2025-01-20 08:57:48 +01:00
Julian	26ae38960a	Improve error on empty or non dict config file (#3238 ) Test if config (file) parsed with yaml_load() contains a valid Mapping object, otherwise Patroni throws an explicit exception. It also makes the Patroni output more explicit when using that kind of "invalid" configuration. ``` console $ touch /tmp/patroni.yaml $ patroni --validate-config /tmp/patroni.yaml /tmp/patroni.yaml does not contain a dict invalid config file /tmp/patroni.yaml ``` reportUnnecessaryIsInstance is explicitly ignored since we can't determine what yaml_safeload can bring from a YAML config (list, dict,...).	2025-01-17 14:44:47 +01:00
Alexander Kukushkin	836e527e6d	Fix deps compatibility, increase tests coverage i(#3233 ) * Compatibility with python-json-logger>=3.1 After refactoring the old API is still working, but producing warnings and pyright also fails. Besides that improve coverage of watchdog/base.py and ctl.py * Stick to ubuntu 22.04 * Please pyright	2024-12-24 09:11:17 +01:00
Alexander Kukushkin	e73f2044c8	Cancel long-running jobs on Patroni stop (#3232 ) Patroni could be doing replica bootstrap and we don't want want pg_basebackup/wal-g/pgBackRest/barman or similar keep running. Besides that, remove data directory on replica bootstrap failure if configuration allows. Close #3224	2024-12-12 09:52:03 +01:00
Polina Bungina	39f5de2e77	Implement sync_priority tag (#3223 )	2024-12-10 14:57:47 +01:00
avandras	46e20edbc2	Show only the members to be restarted upon restart confirmation (#3226 ) When doing `patronictl restart <clustername> --pending`, the confirmation lists all members, regardless if their restart is really pending: ``` > patronictl restart pgcluster --pending + Cluster: pgcluster (7436691039717365672) ----+----+-----------+-----------------+---------------------------------+ \| Member \| Host \| Role \| State \| TL \| Lag in MB \| Pending restart \| Pending restart reason \| +--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+ \| win1 \| 10.0.0.2 \| Sync Standby \| streaming \| 8 \| 0 \| * \| hba_file: [hidden - too long] \| \| \| \| \| \| \| \| \| ident_file: [hidden - too long] \| \| \| \| \| \| \| \| \| max_connections: 201->202 \| +--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+ \| win2 \| 10.0.0.3 \| Leader \| running \| 8 \| \| * \| hba_file: [hidden - too long] \| \| \| \| \| \| \| \| \| ident_file: [hidden - too long] \| \| \| \| \| \| \| \| \| max_connections: 201->202 \| +--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+ \| win3 \| 10.0.0.4 \| Replica \| streaming \| 8 \| 0 \| \| \| +--------+----------+--------------+-----------+----+-----------+-----------------+---------------------------------+ When should the restart take place (e.g. 2024-11-27T08:27) [now]: Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []: Are you sure you want to restart members win1, win2, win3? [y/N]: ``` When we proceed with the restart despite the scary message mentioning all members, not just the ones needing a restart, there will be an error message stating the node not to be restarted was indeed not restarted: ``` Are you sure you want to restart members win1, win2, win3? [y/N]: y Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []: Success: restart on member win1 Success: restart on member win2 Failed: restart for member win3, status code=503, (restart conditions are not satisfied) ``` The misleading confirmation message can also be seen when using the `--any` flag. The current PR is fixing this. However, we do not apply filtering in case of scheduled pending restart, because the condition must be evaluated at the scheduled time.	2024-12-10 12:04:47 +01:00
Michael Banck	578dc39291	Add optional 'cluster_type' attribute to permanent replication slots. (#3229 ) This allows to set whether a particular permanent replication slot should always be created ('cluster_type=any', the default), or just on a primary ('cluster_type=primary') or standby ('cluster_type=standby') cluster, respectively.	2024-12-10 11:55:59 +01:00
Ants Aasma	9d1609e0eb	Reduce log level of watchdog configuration failure (#3231 ) When in automatic mode we probably don't need to warn user about failure to set up watchdog. This is the common case and makes many users think that this feature is somehow necessary to run Patroni safely. For most users it is completely fine to run without and it makes sense to reduce their log spam.	2024-12-10 11:54:27 +01:00
Polina Bungina	fb0fcc859a	Release v4.0.4 (#3221 ) * Release v4.0.4 - Increase version - Use latest pyright - Add RNs	2024-11-22 14:29:59 +01:00
Alexander Kukushkin	a903438a5a	Compatibility with ydiff==1.4.2 (#3216 ) 1. Implemented compatibility. 2. Constrained the upper version in requirements.txt to avoid future failures. 3. Setup an additional pipeline to check with the latest ydiff. Close #3209 Close #3212 Close #3218	2024-11-19 09:27:49 +01:00
Alexander Kukushkin	19f75b407e	Compatibility with prettytable>=3.12.0 (#3217 ) They started showing deprecation warning when importing ALL and FRAME constants.	2024-11-19 09:09:09 +01:00
Alexander Kukushkin	3f00b7a6c7	Restore compatibility with python-consul2 (#3215 ) It was broken in #3191	2024-11-19 09:08:50 +01:00
Kian-Meng Ang	4ce0f99cfb	Fix typos (#3204 ) Found via `codespell -H` and `typos --hidden --format brief`	2024-11-12 10:06:53 +01:00

1 2 3 4 5 ...

2515 Commits