patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 10:20:10 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	a4bd6a9b4b	Refactor postgresql class (#1060 ) * Convert postgresql.py into a package * Factor out cancellable process into a separate class * Factor out connection handler into a separate class * Move postmaster into postgresql package * Factor out pg_rewind into a separate class * Factor out bootstrap into a separate class * Factor out slots handler into a separate class * Factor out postgresql config handler into a separate class * Move callback_executor into postgresql package This is just a careful refactoring, without code changes.	2019-05-21 16:02:47 +02:00
Alexander Kukushkin	4a4258fc3f	Mock external resources (#995 ) unit tests should not accidentally hit running Postgres, DCS or filesystem unless we want it explicitly.	2019-03-12 10:39:42 +01:00
Alexander Kukushkin	c64d51f79c	Better support for static etcd cluster (#986 ) if the `etcd.use_proxies` is set to true, Patroni will stick to the list of hosts specified in the `etcd.hosts` and avoid doing topology discovery. Such mode might be useful when you know that you connect to the etcd cluster via the set of proxies or when th etcd cluster has static topology.	2019-03-07 11:36:36 +01:00
Alexander Kukushkin	76d1b4cfd8	Minor fixes (#808 ) * Use `shutil.move` instead of `os.replace`, which is available only from 3.3 * Introduce standby-leader health-check and consul service * Improve unit tests, some lines were not covered * rename `assertEquals` -> `assertEqual`, due to deprecation warning	2018-09-19 16:32:33 +02:00
Alexander Kukushkin	5ce18a8045	Improve protection of DCS being accidentally wiped (#680 ) We already have a lot of logic in place to prevent failover in such case and restore all keys, but an accidental removal of `/config` key was effectively switching off pause mode for 1 cycle of HA loop.	2018-05-18 11:18:58 +02:00
Alexander Kukushkin	89a11fed07	Don't rediscover etcd cluster topology when watch timed out (#630 ) but switch to the next node if it is possible. Fixes https://github.com/zalando/patroni/issues/628	2018-02-26 18:48:30 +01:00
Alexander Kukushkin	03c2a85d23	Expose current timeline in DCS and via API (#591 ) It is very easy to get current timeline on the master by executing ```sql SELECT ('x' \|\| SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int ``` Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master. Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are: * Second column is the absolute wal position in bytes, instead of LSN * Optionally there might be a fourth column - timestamp, (mtime of history file)	2018-01-05 15:25:56 +01:00
Alexander Kukushkin	0e01bb33bb	Improve patronictl reinit (#576 ) Make it possible to cancel a running task if you want to reinitialize replica. There are two possible ways to trigger it: 1. patronictl will ask whether you want to cancel already running task if an attempt to trigger reinitialize has failed 2. if you are using `--force` argument with `patronictl reinit`	2018-01-04 10:31:44 +01:00
Alexander Kukushkin	b6425cab85	Allow to specify multiple hosts for etcd (#589 ) This list will be used for initial discovery of etcd cluster members. If for some reason during work this list of hosts has been exhausted (during work), Patroni will return to initial list. In addition to that improve ipv6 compatibility by using a special function for splitting host and port. Fixes https://github.com/zalando/patroni/issues/523	2018-01-04 10:25:06 +01:00
Alexander Kukushkin	4328c15010	Make Patroni Kubernetes native (#500 ) * Use ConfigMaps or Endpoins for leader elections and to keep cluster state * Label pods with a postgres role * change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`	2017-12-08 16:55:00 +01:00
Ants Aasma	856a13e24c	Remove error spinning on etcd failure and reduce log spam (#429 ) When all etcd servers refuse connections during watch the call will fail with an exception and will be immediately retried. This creates a huge amount of log spam potentially creating additional issues on top of losing the DCS. This patch takes note if etcd failures are repeating and starting from the second failure will sleep for a second before retrying. It additionally omits the stack trace after the first failure in a streak of failures.	2017-04-20 12:40:15 +02:00
Alexander Kukushkin	1ed91a93c6	Handle EtcdEventIndexCleared and EtcdWatcherCleared exceptions (#387 ) If this case it doesn't make sense to retry, because it brings nothing but produces a log of exceptions in the log...	2017-02-16 17:07:09 +01:00
Alexander Kukushkin	c6252bc004	Don't resolve url hostnames manualy but mokey patch urllib3 (#385 ) Change hostnames by ip addresses was causing certificate verification to fail. Instead of doing it we will better monkey patch urllib3 functionality which does name resolution. It should work without problems even for https connection.	2017-01-18 13:46:02 +01:00
Alexander Kukushkin	711d53980f	Call self._load_machines_cache() method on timeout is causing switch to a new server every 5 minutes	2017-01-12 17:30:18 +01:00
Alexander Kukushkin	d138a8db17	AT for master_start_timeout + minor fixes (#361 )	2016-12-09 12:02:41 +01:00
Ants Aasma	1290b30b84	Introduce starting state and master start timeout. (#295 ) Previously pg_ctl waited for a timeout and then happily trodded on considering PostgreSQL to be running. This caused PostgreSQL to show up in listings as running when it was actually not and caused a race condition that resulted in either a failover or a crash recovery or a crash recovery interrupted by failover and a missed rewind. This change adds a master_start_timeout parameter and introduces a new state for the main run_cycle loop: starting. When master_start_timeout is zero we will fail over as soon as there is a failover candidate. Otherwise PostgreSQL will be started, but once master_start_timeout expires we will stop and release leader lock if failover is possible. Once failover succeeds or fails (no leader and no one to take the role) we continue with normal processing. While we are waiting for the master timeout we handle manual failover requests. * Introduce timeout parameter to restart. When restart timeout is set master becomes eligible for failover after that timeout expires regardless of master_start_time. Immediate restart calls will wait for this timeout to pass, even when node is a standby.	2016-12-08 14:44:27 +01:00
Alexander Kukushkin	ec78777778	Implement simple asynchronos dns-resolve cache (#360 )	2016-12-07 13:16:26 +01:00
Alexander Kukushkin	b299b12f58	Varios configuration parameters for etcd (#358 ) * Add https and auth support for etcd Also implement support of PATRONI_ETCD_URL and PATRONI_ETCD_SRV environment variables * Implement etcd.proxy etcd.cacert, etcd.cert and etcd.key support Now it should be possible to set up fully encrypted connection to etcd with authorization.	2016-12-06 16:40:21 +01:00
Alexander Kukushkin	038b5aed72	Improve leader watch functionality (#356 ) Previously replicas were always watching for leader key (even if the postgres was not in the running there). It was not a big issue, but it was not possible to interrupt such watch in cases if the postgres started up or stopped successfully. Also it was delaying update_member call and we had kind of stale information in DCS up to `loop_wait` seconds. This commit changes such behavior. If the async_executor is busy by starting/stopping or restarting postgres we will not watch for leader key but waiting for event from async_executor up to `loop_wait` seconds. Async executor will fire such event only in case if the function it was calling returned something what could be evaluated to boolean True. Such functionality is really needed to change the way how we are making decision about necessity of pg_rewind. It will require to have a local postgres running and for us it is really important to get such notification as soon as possible.	2016-11-22 16:22:30 +01:00
Ants Aasma	7e53a604d4	Add synchronous replication support. (#314 ) Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions. To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over. We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions. On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS. When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock. Added acceptance tests and documentation. Implementation by @ants with extensive review by @CyberDem0n.	2016-10-19 16:12:51 +02:00
Alexander Kukushkin	19c80df442	Try to mitigate EtcdEventIndexCleared exception (#287 ) This error is send by etcd when Patroni is doing "watch" on leader key which is never updated after creation and etcd cluster receives a lot of updates, what cleans history of events. Instead of doing watch on modifiedIndex + 1 we will do watch on X-Etcd-Index, which is probably still available...	2016-09-02 13:44:47 +02:00
Alexander Kukushkin	413a84836b	Update etcd topology only after original request succeed (#254 ) There is no point to try to update topology until original request is not performed. Also for us it is more important to execute original request rather then keep topology of etcd cluster in sync. In addition to that implement the same retry-timeout logic in the `machines` property which already is used in `api_execute` method.	2016-08-10 10:17:37 +02:00
Murat Kabilov	a47a2bceff	Manage scheduled restarts using patronictl (#248 ) Manage scheduled restarts using patronictl	2016-08-09 12:54:48 +02:00
Alexander Kukushkin	876cfdfb2d	Fix retry logic in etcd.py Client class takes care about retrying when connection to the etcd node fails. It calculates amount of retries and timeout depending on etcd cluster size. Etcd class should not retry when EtcdConnectionFailed exception is raised (this case is already handled in the Client). Besides that adjust retry timeouts in the Client class.	2016-06-29 15:30:54 +02:00
Alexander Kukushkin	c8b5003b86	Set __do_not_watch flag when ttl needs to be changed it's more readable comparing to `reset_cluster`	2016-06-01 13:41:49 +02:00
Alexander Kukushkin	b3ada161cf	Implement possibility to configure `retry_timeout` globally Previously it was hardcoded all over the place.	2016-05-31 10:30:53 +02:00
Alexander Kukushkin	45cbc8ca70	Implement acceptance test for dynamic configuration functionality and fix some bugs revealed by acceptance tests	2016-05-26 10:16:24 +02:00
Alexander Kukushkin	7827951c8c	Dynamic configuration	2016-05-25 14:17:05 +02:00
Alexander Kukushkin	0c2aad98a3	Move dcs implementations into dcs package	2016-05-19 10:57:18 +02:00
Alexander Kukushkin	1741fa7e0f	Mininize number of references to dcs implementations from tests where it is not necessary (test_ha, test_ctl, etc...) It will simplyfy further refactoring and make it possible to install implementations of AbstractDCS independant of each other.	2016-05-19 10:00:32 +02:00
Alexander Kukushkin	bcbc080350	urllib3.exceptions.HTTPError fixes for python 3.5.1 Somehow when you import only urllib3 it's not possible work with urllib3.exceptions.HTTPError exception (it looks like it is imported from some other place. from urllib3.exceptions import HTTPError solves the problem.	2016-04-24 14:18:34 +02:00
Alexander Kukushkin	3a7d2c3874	Remove unused code from unit tests	2016-03-21 20:48:17 +01:00
Alexander Kukushkin	0e0c8ed8d7	Implement `delete_cluster` interface in for all available dcs In addition to that rename confusing `Etcd.client` and `ZooKeeper.client` into `_client`. This attribute is available from AbstractDCS and people had wrong impression that it provides the same interface for different DCS implementations, which is obviously not the case. For Etcd it has type etcd.Client and for ZooKeeper - KazooClient.	2016-03-15 16:25:48 +01:00
Alexander Kukushkin	01afd09ca2	Migrate to python-etcd 0.4.3 Despite this release was very buggy it has really nice features: * EtcdWatchTimedOut exception is raised when `watch` call timed out * it supports SRV autodiscovery Since we already implemented our own SRV discovery this feature is not really interesting for us, but it solves the problem of having two requirements files for different python versions, because python-etcd will install dnspython or dnspython3 as a dependency. In order to fix https://github.com/jplana/python-etcd/issues/152 and https://github.com/jplana/python-etcd/pull/154 I had to override `api_execute` method.	2016-03-12 15:49:42 +01:00
Alexander Kukushkin	cb38e50ac1	Remove unused code	2016-02-26 08:50:53 +01:00
Alexander Kukushkin	f079a9f308	remove unused code	2016-02-17 12:34:22 +01:00
Alexander Kukushkin	a875e93f2e	Merge branch 'master' of github.com:zalando/patroni into feature/scheduled_failover_squashed	2016-02-17 12:14:10 +01:00
Alexander Kukushkin	df9b8fed2e	Improve quality of code by resolving issues found by quantifiedcode and codacy	2016-02-12 12:23:49 +01:00
Feike Steenbergen	1e2fdac891	Scheduled Failover tests Add tests for the scheduled failover feature, also add more and better tests for patronictl.	2016-02-10 14:19:41 +01:00
Feike Steenbergen	bce96df177	Add attributes to Mocked classes	2016-01-29 13:29:51 +01:00
Oleksii Kliukin	c003af294a	Merge pull request #82 from zalando/feature/patroni_cli_or_ctl_tbd Feature/patroni cli or ctl tbd	2015-11-18 16:17:02 +01:00
Feike Steenbergen	ca4d9eaaf9	Patronictl: Expand tests to increase coverage	2015-11-18 11:51:24 +01:00
Oleksii Kliukin	fef7d45208	Handle unexpected exceptions in etcd. Previously, patroni would die after receiving an exception other than RetryFailedError, etcd.EtcdException from etcd. We have observed an AttributeError raised by etcd on some occasions. With this change, we demote ourselves, but not terminate on such exceptions.	2015-11-17 16:08:58 +01:00
Oleksii Kliukin	84db64e0d5	Merge branch 'master' of https://github.com/zalando/patroni into feature/nofailover	2015-10-26 10:41:51 +01:00
Oleksii Kliukin	b7b47ffd79	Add support for the nofailover tag.	2015-10-23 10:11:38 +02:00
Alexander Kukushkin	2c7e3f60cc	Make possible to override default namespace (/service/) from a config file If the namespace is not specified in a config file /service/ would be used. Also it's possible to use just '/' as a namespace. It means we would have following structure: /scope1 /scope2 ...	2015-10-21 15:34:55 +02:00
Alexander Kukushkin	c4a6dd48d3	remove debug print statement	2015-10-21 11:09:37 +02:00
Alexander Kukushkin	0096b6b06f	Schedule update of machines cache when api_execute call has failed Such situation could happen if we replaced all etcd nodes except one which was used by patroni. After replacing the last node patroni will try to execute request on all other nodes from machines_cache but non of them are available. Michines cache would became empty and patroni will stick to the latest node which was available in the machines_cache and will never try to refresh machines_cache from dns for example. Currently machines cache is refreshed only when one request to the etcd cluster has failed, but probably it should be done periodically, for example every minute...	2015-10-21 10:56:43 +02:00
Alexander Kukushkin	d8f4b09478	use Event.wait instead of sleep it makes possible to break "sleep" for example from API plus small bugfix: catch ValueError exception from json.loads	2015-10-02 10:26:48 +02:00
Alexander Kukushkin	d09875a056	refactoring: 1. run touch_member from the main loop 2. move code which takes care about long tasks into separate class 3. change format of data stored in a DCS: use json instead of url 4. change Member class: from now it deserialize everything into data property 5. rework API: from now it takes into account state of the current node in a dcs	2015-10-01 17:06:42 +02:00

1 2

90 Commits