patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-28 02:20:04 +00:00

Author	SHA1	Message	Date
zhjwpku	e131065d74	rename citus_handler to mpp_handler (#2991 ) obey the following 5 meanings of terminology _cluster_ in Patroni. 1. PostgreSQL cluster: a cluster of postgresql instances which have the same system identifier. 2. MPP cluster: a cluster of PostgreSQL clusters that one of them acts as Coodinator and others act as workers. 3. Coordinator cluster: a PostgreSQL cluster which act the role of 'coordinator' within a MPP cluster. 4. Worker cluster: a PostgreSQL cluster which act the role 'worker' within a MPP cluster. 5. Patroni cluster: all cluster managed by Patroni can be called Patroni cluster, but we usually use this term to refering a single PostgreSQL cluster or an MPP cluster.	2024-02-28 06:16:20 +01:00
Alexander Kukushkin	bcfd8438a5	Abstract CitusHandler and decouple it from configuration (#2950 ) the main issue was that the configuration for Citus handler and for DCS existed in two places, while ideally AbstractDCS should not know many details about what kind of MPP is in use. To solve the problem we first dynamically create an object implementing AbstractMPP interfaces, which is a configuration for DCS. Later this object is used to instantiate the class implementing AbstractMPPHandler interface. This is just a starting point, which does some heavy lifting. As a next steps all kind of variables named after Citus in files different from patroni/postgres/mpp/citus.py should be renamed. In other words this commit takes over the most complex part of #2940, which was never implemented. Co-authored-by: zhjwpku <zhjwpku@gmail.com>	2023-12-21 08:58:26 +01:00
Alexander Kukushkin	aa3ebe0af8	Don't cache anything in Zookeeper implementation (#2909 ) Cache creates a lot of problems and prevents implementing a feature of automatic retention of physical replication slots for members with configurable retention policy. Just read the entire cluster from Zookeeper instead and use watchers only for the `/leader` and `/config` keys.	2023-10-17 08:56:31 +02:00
Alexander Kukushkin	48514db84b	Take into account current role when deciding on removal of member ZNode (#2884 ) Patroni doesn't watch on all changes of member keys in order to not create too much load on ZooKeeper, but only subscribes to changes (ZNodes added or deleted) in the `/member` directory. Therefore when some important fields in the value are updated we remove and recreate ZNode in order to notify the leader or other members. The leader should remove the member key only when the `checkpoint_after_promote` value is changed and replicas when the `state` is changed to/from `running`. We don't care about the `version` field, because Patroni version can't be changed without restart, what will case ZooKeeper `session_id` to change it anyway. This fix hopefully will reduce failures of behave tests on GH Actions.	2023-09-26 09:12:31 +02:00
Alexander Kukushkin	9209a5a133	Refactor delete_leader interface (#2810 ) similar to https://github.com/zalando/patroni/pull/2690, but it helps mostly Consul implementation.	2023-08-11 10:19:29 +02:00
Alexander Kukushkin	7e89583ec7	Please new flake8 (#2789 ) it stopped liking lack of space character between `,` and `\` ```python foo,\ bar ```	2023-07-31 09:08:46 +02:00
Alexander Kukushkin	af8e5f0d0f	Refactor update_leader interface (#2690 ) pass reference to a last known leader object in order to avoid obtaining it from the `AbstractDCS.cluster` cache. This change is useful for Consul, Etcd3 and Zookeeper implementations.	2023-05-25 14:21:05 +02:00
Alexander Kukushkin	76b3b99de2	Enable pyright strict mode (#2652 ) - added pyrightconfig.json with typeCheckingMode=strict - added type hints to all files except api.py - added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules - added type stubs for psycopg2 and urllib3 with some little fixes - fixes most of the issues reported by pyright - remaining issues will be addressed later, along with enabling CI linting task	2023-05-09 09:38:00 +02:00
Polina Bungina	3fe2a7868a	Ignore D401 in flake8-docstrings (#2627 ) * Ignore D401 in flake8-docstrings * Fix newly reported flake8 issues, ignore the old W503 rule * rely on concatenation of adjecent strings * Format behave scripts * Reformat ha.py according to new rules Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>	2023-04-03 09:52:22 +02:00
Alexander Kukushkin	c1bfb0e6d6	Remove python 2.7 support (#2571 ) - get rid from 2.7 specific modules: `six`, `ipaddress` - use Python3 unpacking operator - use `shutil.which()` instead of `find_executable()`	2023-03-13 17:00:04 +01:00
Alexander Kukushkin	4872ac51e0	Citus integration (#2504 ) Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together: ``` /service/batman/ /service/batman/0/ /service/batman/0/initialize /service/batman/0/leader /service/batman/0/members/ /service/batman/0/members/m1 /service/batman/0/members/m2 /service/batman/ /service/batman/1/ /service/batman/1/initialize /service/batman/1/leader /service/batman/1/members/ /service/batman/1/members/m1 /service/batman/1/members/m2 ... ``` Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups. Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper). The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group. Besides that we introduce a new method get_citus_coordinator(). It will be used only by worker clusters. Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates. E.g. ``` batman-0-leader # the leader config map for the coordinator batman-0-config # the config map holding initialize, config, and history "keys" ... batman-1-leader # the leader config map for worker group 1 batman-1-config ... ``` Citus integration is enabled from patroni.yaml: ```yaml citus: database: citus group: 0 # 0 is for coordinator, 1, 2, etc are for workers ``` If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file. When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section. Besides that, Patroni takes over management of some Postgres GUCs: - `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place - `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2` - wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal. The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using citus_add_node() and citus_update_node() functions. Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries. When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker. Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked. If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike. All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.	2023-01-24 16:14:58 +01:00
Alexander Kukushkin	92d3e1c167	Introduce the failsafe key in DCS (#2485 ) Extracted from #2379	2022-12-13 11:35:06 +01:00
Alexander Kukushkin	6ad5fee99d	Raise DCSError when communication with DCS fails (#2484 ) Previously such an exception was raised only from the `get_cluster()` method, and now we will to do the same from the `update_leader()` and `attempt_to_acquire_leader()` methods. These methods influence Postgres promotion and demotion and we want to make a difference between different types of failures. Specifically, if calls have failed because DCS isn't accessible or due to a timeout. This commit is extracted from the #2379	2022-12-13 11:06:55 +01:00
Alexander Kukushkin	531063f676	Compatibility with kazoo-2.9.0 (#2428 ) Now the select() method may raise `TypeError` and `IOError` exceptions if the socket is closed.	2022-10-13 09:18:06 +02:00
Alexander Kukushkin	cb3071adfb	Annual cleanup (#2159 ) - Simplify setup.py: remove unneeded features and get rid of deprecation warnings - Compatibility with Python 3.10: handle `threading.Event.isSet()` deprecation - Make sure setup.py could run without `six`: move Patroni class and main function to the `__main__.py`. The `__init__.py` will have only a few functions used by the Patroni class and from the setup.py	2022-01-06 10:20:31 +01:00
Alexander Kukushkin	dc9ff4cb8a	Release 2.1.2 (#2136 ) * Implement missing unit-tests * Bump version * Update release notes	2021-12-03 15:49:57 +01:00
Alexander Kukushkin	63ee42a85c	Clear event on the leader node when /status was updated (#2125 ) Not doing so causing excessive HA loop runs with Zookeeper. This moment wasn't fixed correctly in the #1875	2021-11-30 16:33:38 +01:00
Alexander Kukushkin	00d125c512	Avoid unnecessary updates of the members ZNode. (#2115 ) When deciding whether the ZNode should be updated we rely on the cached version of the cluster, which is updated only when members ZNodes are deleted/created or the `/status`, `/sync`, `/failover`, `/config`, or `/history` ZNodes are updated. I.e. after the update of the current member ZNode succeeded the cache becomes stale and all further updates are always performed even if the value didn't change. In order to solve it, we introduce the new attribute in the Zookeeper class and will use it for memorizing the actual value and for later comparison.	2021-11-12 15:00:54 +01:00
Alexander Kukushkin	77382e75dc	Compatibility with kazoo-2.7+ (#1982 ) Old versions of `kazoo` immediately discarded all requests to Zookeeper if the connection is in the `SUSPENDED` state. This is absolutely fine because Patroni is handling retries on its own. Starting from 2.7, kazoo started queueing requests instead of discarding and as a result, the Patroni HA loop was getting stuck until the connection to Zookeeper is reestablished, causing no demote of the Postgres. In order to return to the old behavior we override the `KazooClient._call()` method. In addition to that, we ensure that the `Postgresql.reset_cluster_info_state()` method is called even if DCS failed (the order of calls was changed in the #1820). Close https://github.com/zalando/patroni/issues/1981	2021-06-30 09:11:27 +02:00
Alexander Kukushkin	c7173aadd7	Failover logical slots (#1820 ) Effectively, this PR consists of a few changes: 1. The easy part: In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots. In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster. 2. The tricky part: On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function. When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward. 3. Additional requirements: In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots. 4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use. Compatibility. The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary. The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed. Close: https://github.com/zalando/patroni/issues/1749	2021-03-25 16:18:23 +01:00
Alexander Kukushkin	e3ef9ac306	Fix issues with zookeeper (#1792 ) 1. The `ttl` was incorrectly returned 1000 times higher then it should 2. The `watch()` method must return True if the parent method returned True. Not doing so resulted in the incorrect calculation of sleep time. 3. Move mock of exhibitor api to the features/environment.py. It simplifies testing with behave.	2020-12-14 15:12:57 +01:00
Alexander Kukushkin	04b9fb9dd4	Make sure cached last_leader_operation is up-to-date on replicas (#1600 ) Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. So far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date: * Etcd: since the `/optime/leader` key is updated right after the `/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old. ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old. * Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg. * Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, which might have negative side effects. Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds. Will help to implement #1599	2020-07-15 10:31:32 +02:00
Alexander Kukushkin	c2a78ee652	Bugfix: GET /cluster was showing stale member info in zookeeper (#1573 ) Zookpeeper implementation heavily relies on cached version of the cluster view in order to minimize the number of requests. Having stale members information is fine for Patroni workflow because it basically relies only on member names and tags. The `GET /cluster` is a different case. Being exposed outside it might be used for monitoring purposes and therefore we should show the up-to-date members information.	2020-06-05 09:23:54 +02:00
Alexander Kukushkin	680444ae13	Reduce lock time taken by dcs.get_cluster() (#989 ) `dcs.cluster` and `dcs.get_cluster()` are using the same lock resource and therefore when get_cluster call is slow due to the slowness of DCS it was also affecting the `dcs.cluster` call, which in return was making health-check requests slow.	2019-03-12 22:37:11 +01:00
Alexander Kukushkin	9bf074acfb	Compatibility with python3 (#883 ) Change of `loop_wait` was causing Patroni to disconnect from zookeeper and never reconnect back. The error was happening only with python3 due to a difference in implementation of `select.select` function.	2018-11-30 11:40:34 +01:00
Alexander Kukushkin	fb01aaebc5	Compatibility with kazoo-2.6.0 (#872 ) Recently 2.6.0 was release which changes the way how create_connection method is called. Before it was passing two arguments, and in the new version all argument names are specified explicitly.	2018-11-19 14:26:20 +01:00
Alexander Kukushkin	4ca8a6e506	Make retries of calls to DCS consistent across implementations (#805 ) in addition to that do a small refactoring of zookeeper and consul and try to improve the stability of AT	2018-09-06 08:37:26 +02:00
Alexander Kukushkin	03c2a85d23	Expose current timeline in DCS and via API (#591 ) It is very easy to get current timeline on the master by executing ```sql SELECT ('x' \|\| SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int ``` Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master. Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are: * Second column is the absolute wal position in bytes, instead of LSN * Optionally there might be a fourth column - timestamp, (mtime of history file)	2018-01-05 15:25:56 +01:00
Alexander Kukushkin	4328c15010	Make Patroni Kubernetes native (#500 ) * Use ConfigMaps or Endpoins for leader elections and to keep cluster state * Label pods with a postgres role * change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`	2017-12-08 16:55:00 +01:00
Alexander Kukushkin	038b5aed72	Improve leader watch functionality (#356 ) Previously replicas were always watching for leader key (even if the postgres was not in the running there). It was not a big issue, but it was not possible to interrupt such watch in cases if the postgres started up or stopped successfully. Also it was delaying update_member call and we had kind of stale information in DCS up to `loop_wait` seconds. This commit changes such behavior. If the async_executor is busy by starting/stopping or restarting postgres we will not watch for leader key but waiting for event from async_executor up to `loop_wait` seconds. Async executor will fire such event only in case if the function it was calling returned something what could be evaluated to boolean True. Such functionality is really needed to change the way how we are making decision about necessity of pg_rewind. It will require to have a local postgres running and for us it is really important to get such notification as soon as possible.	2016-11-22 16:22:30 +01:00
Ants Aasma	7e53a604d4	Add synchronous replication support. (#314 ) Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions. To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over. We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions. On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS. When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock. Added acceptance tests and documentation. Implementation by @ants with extensive review by @CyberDem0n.	2016-10-19 16:12:51 +02:00
Alexander Kukushkin	5fe74bec3b	Make different kazoo timeouts depend on loop_wait (#243 ) * Make different kazoo timeouts dependant on loop_wait ping timeout ~ 1/2 * loop_wait connect_timeout ~ 1/2 * loop_wait Originally these values were calculated from negotiated session timeout and didn't worked very well, because it was taking significant time to figure out that connection is dead and reconnect (up to session timeout) and not giving us time to retry. * Address the code review	2016-08-10 10:15:09 +02:00
Alexander Kukushkin	f7c6bd4eab	Implement different connect strategy for zookeeper Originally it was trying to connect during session_timeout time. Such strategy doesn't work good during short network hiccups...	2016-07-01 12:31:29 +02:00
Alexander Kukushkin	5f4e582660	Merge branch 'master' of github.com:zalando/patroni into feature/dynamic-configuration	2016-06-09 11:04:28 +02:00
Alexander Kukushkin	50d118c3aa	Split ZooKeeper and Exhibitor Originally Exhibitor was supported in the ZooKeeper class and configuration for Exhibitor was taken also from `zookeeper` section in the yaml config file. In fact, Exhibitor just extends ZooKeeper and now it is reflected in the code and also Exhibitor got it's own section in the config.yaml file. It will make it easier to configure Exhibitor hosts and port via environment variables when PR#211 will be merged.	2016-06-08 19:21:18 +02:00
Alexander Kukushkin	b3ada161cf	Implement possibility to configure `retry_timeout` globally Previously it was hardcoded all over the place.	2016-05-31 10:30:53 +02:00
Alexander Kukushkin	7827951c8c	Dynamic configuration	2016-05-25 14:17:05 +02:00
Alexander Kukushkin	6104d688d9	Merge branch 'master' of github.com:zalando/patroni into feature/sighup	2016-05-19 14:27:04 +02:00
Alexander Kukushkin	0c2aad98a3	Move dcs implementations into dcs package	2016-05-19 10:57:18 +02:00
Alexander Kukushkin	1741fa7e0f	Mininize number of references to dcs implementations from tests where it is not necessary (test_ha, test_ctl, etc...) It will simplyfy further refactoring and make it possible to install implementations of AbstractDCS independant of each other.	2016-05-19 10:00:32 +02:00
Alexander Kukushkin	d422e16aad	Implement reload of config.yaml on SIGHUP If some changes require restart of postgres patroni will expose `restart_pending` flag in DCS and via REST API	2016-05-13 13:31:21 +02:00
Alexander Kukushkin	0d3dca56ff	In some cases Ha.cluster can be None after calling `get_cluster` Such situation is causing patroni crash. Usually it was happening during manual failover, after former master has demoted and `reset_cluster` method has been called. In this case `fetch_cluster` was `False` and `_load_cluster` method was returning value from `self._cluster`, which was `None`.	2016-03-24 12:06:39 +01:00
Alexander Kukushkin	3a7d2c3874	Remove unused code from unit tests	2016-03-21 20:48:17 +01:00
Alexander Kukushkin	54055c1ff8	Rename ambiguous `Failover.member` to candidate But! 'member' is still accepted by REST API and also name 'member' is used to strore/read this value to/from DCS (for backward comatibility)	2016-03-18 15:59:47 +01:00
Alexander Kukushkin	0e0c8ed8d7	Implement `delete_cluster` interface in for all available dcs In addition to that rename confusing `Etcd.client` and `ZooKeeper.client` into `_client`. This attribute is available from AbstractDCS and people had wrong impression that it provides the same interface for different DCS implementations, which is obviously not the case. For Etcd it has type etcd.Client and for ZooKeeper - KazooClient.	2016-03-15 16:25:48 +01:00
Alexander Kukushkin	df9b8fed2e	Improve quality of code by resolving issues found by quantifiedcode and codacy	2016-02-12 12:23:49 +01:00
Alexander Kukushkin	a6603e8b48	bugfix in zookeeper module: when master node was being attached to patroni/zookeeper (no cluster in zookeeper yet) patroni has never tried to "refetch" cluster from DCS. It was leeding to demote...	2015-10-08 13:07:38 +02:00
Alexander Kukushkin	8a844285ff	Set fetch_cluster flag to False when _inner_load_cluster called Set the same flag to True if the cluster does not yet exists in ZooKeeper	2015-10-07 16:48:39 +02:00
Alexander Kukushkin	d8f4b09478	use Event.wait instead of sleep it makes possible to break "sleep" for example from API plus small bugfix: catch ValueError exception from json.loads	2015-10-02 10:26:48 +02:00
Alexander Kukushkin	d09875a056	refactoring: 1. run touch_member from the main loop 2. move code which takes care about long tasks into separate class 3. change format of data stored in a DCS: use json instead of url 4. change Member class: from now it deserialize everything into data property 5. rework API: from now it takes into account state of the current node in a dcs	2015-10-01 17:06:42 +02:00

1 2

71 Commits