patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-28 10:20:05 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	f8b3703d6e	Bugfix: failover via API didn't work due to change in _MemberStatus (#489 ) Originally fetch_nodes_statuses was returning a tuple, later it was wrapped into namedtuple _MemberStatus and recently _MemberStatus was extened with watchdog_failed field, but api.py was still relying on usual tuple and checking failover limitations on it's own instead of calling `failover_limitation` method.	2017-07-28 15:38:55 +02:00
Ants Aasma	70d718a058	Simplify watchdog code (#452 ) * Only activate watchdog while master and not paused We don't really need the protections while we are not master. This way we only need to tickle the watchdog when we are updating leader key or while demotion is happening. As implemented we might fail to notice to shut down the watchdog if someone demotes postgres and removes leader key behind Patroni's back. There are probably other similar cases. Basically if the administrator if being actively stupid they might get unexpected restarts. That seems fine. * Add configuration change support. Change MODE_REQUIRED to disable leader eligibility instead of closing Patroni. Changes watchdog timeout during the next keepalive when ttl is changed. Watchdog driver and requirement can also be switched online. When watchdog mode is `required` and watchdog setup does not work then the effect is similar to nofailover. Add watchdog_failed to status API to signify this. This is True only when watchdog does not work AND it is required. * Reset implementation when config changed while active. * Add watchdog safety margin configuration Defaults to 5 seconds. Basically this is the maximum amount of time that can pass between the calls to odcs.update_leader()` and `watchdog.keepalive()`, which are called right after each other. Should be safe for pretty much any sane scenario and allows the default settings to not trigger watchdog when DCS is not responding. * Cancel bootstrap if watchdog activation fails The system would have demoted itself anyway the next HA loop. Doing it in bootstrap gives at least some other node chance to try bootstrapping in the hope that it is configured correctly. If all nodes are unable to activate they will continue to try until the disk is filled with moved datadirs. Perhaps not ideal behavior, but as the situation is unlikely to resolve itself without administrator intervention it doesn't seem too bad.	2017-07-27 12:16:11 +02:00
Alexander Kukushkin	cd84dc82b6	Implement postgresql-10 support (#444 ) Mainly it handles rename of xlog to wal. In the API and inside DCS it is still named xlog (for compatibility). * Address feedback	2017-05-19 17:04:53 +02:00
Alexander Kukushkin	44a7142a9d	Synchronous mode strict (#435 ) If synchronous_mode_strict==true then '*' will be written as synchronous_standby_names when that last replication host dies.	2017-04-27 14:32:15 +02:00
Oleksii Kliukin	d39f895082	Fix unit tests for Python 3.6 (#431 ) Python 3.6 complains about 'AttributeError: 'MockRequest' object has no attribute 'sendall'	2017-04-18 12:44:42 +02:00
Ants Aasma	1290b30b84	Introduce starting state and master start timeout. (#295 ) Previously pg_ctl waited for a timeout and then happily trodded on considering PostgreSQL to be running. This caused PostgreSQL to show up in listings as running when it was actually not and caused a race condition that resulted in either a failover or a crash recovery or a crash recovery interrupted by failover and a missed rewind. This change adds a master_start_timeout parameter and introduces a new state for the main run_cycle loop: starting. When master_start_timeout is zero we will fail over as soon as there is a failover candidate. Otherwise PostgreSQL will be started, but once master_start_timeout expires we will stop and release leader lock if failover is possible. Once failover succeeds or fails (no leader and no one to take the role) we continue with normal processing. While we are waiting for the master timeout we handle manual failover requests. * Introduce timeout parameter to restart. When restart timeout is set master becomes eligible for failover after that timeout expires regardless of master_start_time. Immediate restart calls will wait for this timeout to pass, even when node is a standby.	2016-12-08 14:44:27 +01:00
Alexander Kukushkin	038b5aed72	Improve leader watch functionality (#356 ) Previously replicas were always watching for leader key (even if the postgres was not in the running there). It was not a big issue, but it was not possible to interrupt such watch in cases if the postgres started up or stopped successfully. Also it was delaying update_member call and we had kind of stale information in DCS up to `loop_wait` seconds. This commit changes such behavior. If the async_executor is busy by starting/stopping or restarting postgres we will not watch for leader key but waiting for event from async_executor up to `loop_wait` seconds. Async executor will fire such event only in case if the function it was calling returned something what could be evaluated to boolean True. Such functionality is really needed to change the way how we are making decision about necessity of pg_rewind. It will require to have a local postgres running and for us it is really important to get such notification as soon as possible.	2016-11-22 16:22:30 +01:00
Alexander Kukushkin	37b020e7a3	Various bugfixes and improvements: (#346 ) * Replace pytz.UTC with dateutil.tz.tzutc, it helps to reduce memory by more than 4Mb... * fix check of python version: 0x0300000 => 0x3000000 * Update leader key before restart and demote	2016-11-04 18:42:56 +02:00
Ants Aasma	7e53a604d4	Add synchronous replication support. (#314 ) Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions. To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over. We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions. On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS. When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock. Added acceptance tests and documentation. Implementation by @ants with extensive review by @CyberDem0n.	2016-10-19 16:12:51 +02:00
Alexander Kukushkin	6dc1d9c88e	Trigger reinitialize from api and make it possible to reinitialize in a pause state	2016-08-29 15:38:58 +02:00
Alexander Kukushkin	9fdd021e08	Fix unit-tests for api	2016-08-29 10:25:46 +02:00
Murat Kabilov	3d1fe3fa49	Introduce is_paused method in the Cluster	2016-08-29 09:29:49 +02:00
Murat Kabilov	89ef5da5ae	Add tests for api; add checks for ctl and api for the paused state case	2016-08-29 08:36:35 +02:00
Alexander Kukushkin	96da6340a9	Calculate future restart time dynamically (#268 ) `do_POST_restart` was ramdomly showing not 100% coverage after 2016-08-20 due to hardcoded timestamps.	2016-08-24 09:46:56 +02:00
Alexander Kukushkin	fa7aa71092	Always call on_start callback when starting Patroni (#262 ) When Patroni was "joining" already running postgres it was not calling callbacks, what in some cases causing issues (callback could be used to change routing/load-balancer or assign/remove floating (service) ip. In addition to that we should `start` postgres instead of `restart`-ing it when doing recovery, because in this case 'on_start' callback should be called, instead of 'on_restart'	2016-08-18 09:35:13 +02:00
Alexander Kukushkin	5fe74bec3b	Make different kazoo timeouts depend on loop_wait (#243 ) * Make different kazoo timeouts dependant on loop_wait ping timeout ~ 1/2 * loop_wait connect_timeout ~ 1/2 * loop_wait Originally these values were calculated from negotiated session timeout and didn't worked very well, because it was taking significant time to figure out that connection is dead and reconnect (up to session timeout) and not giving us time to retry. * Address the code review	2016-08-10 10:15:09 +02:00
Oleksii Kliukin	13b4306f40	Remove one more occurrence of the time bomb	2016-07-14 16:53:02 +02:00
Oleksii Kliukin	3181c4e59f	Code review, asynchronous restarts. - Make the restart initiated by the schedule asynchronous - Fix the placeholders in logs. - Fix the regexp to detect the PostgreSQL version.	2016-07-12 20:25:01 +02:00
Oleksii Kliukin	8834f929aa	Improve the unit tests/coverage.	2016-07-05 10:07:29 +02:00
Oleksii Kliukin	d2832ee43b	Address the code review. Fix return value in the should_run_scheduled_action and the comments. Correct the json composition in the scheduled_restart test. Fix the delete in case there is no scheduled restart. Fix the usage of format in the logger output. Fix the indentation in the evaluate_scheduled_restart. Fix the condition related to the body_is_optional in the do_POST_restart. Fix a few typos in the error messages. Fix the _read_json_content Make the scheduled restart unit-tests a bit less ugly	2016-06-28 16:54:20 +02:00
Oleksii Kliukin	568eb730bc	Clear the scheduled restart after the normal one. Make sure the scheduled restart flag is cleared when the postmaster_start_time changes since the time restart was scheduled. Additionally, separate the logic of checking the restart conditions into the function in order to support conditions for the normal restart as well.	2016-06-24 17:39:04 +02:00
Oleksii Kliukin	29845dd383	Restart the node according to the schedule. The scheduled restart data structures are now independent of those used by the normal restarts. This would be fixed in subsequent commits. Add the behave tests, that cover the POST /restart (but not DELETE).	2016-06-23 10:43:54 +02:00
Oleksii Kliukin	318ca6be38	Implement scheduling and deleting a restart. The scheduled restart API extends the already existing restart endpoint by processing the parameters in the request body. Only one scheduled restart at a time is support. DELETE method on the /restart endpoint is used to remove an existing restart.	2016-06-20 15:16:22 +02:00
Alexander Kukushkin	9ecff0f64d	Bugfixes * GET /config was returning latesy "correct" version of dynamic configuration. * PATCH /config was breaking when trying to patch not dict with dict	2016-06-10 12:35:04 +02:00
Alexander Kukushkin	ebb9e252d8	Rename restart_pending to pending_restart for compatibility	2016-06-02 09:31:30 +02:00
Alexander Kukushkin	1c30948ef9	Implement PUT /config and enhance some checks	2016-06-01 17:06:31 +02:00
Alexander Kukushkin	e10873dd9c	RestApiHandler._patch_config returns True if configuration was changed	2016-05-31 15:49:55 +02:00
Alexander Kukushkin	1cd42d4e47	Get rid from some stupid logic with options=True/False And some other tricks with overriding handle_one_request and finish methods from the parent class which were necessary only to make OPTIONS request from haproxy work with python2, but in fact it was still not working with python3. Instead of doing all the magic we should simply give to haproxy what it wants to get: HTTP response code and nothing more.	2016-05-31 14:42:00 +02:00
Alexander Kukushkin	8b5d6e83e7	fix some bugs revaled by acceptance tests	2016-05-27 17:38:19 +02:00
Alexander Kukushkin	073ef3784f	Implement PATCH /config	2016-05-27 16:29:33 +02:00
Alexander Kukushkin	6700cd0aa6	Implement reload of config.yml with REST API call and acceptance tests for that	2016-05-26 17:09:40 +02:00
Alexander Kukushkin	7827951c8c	Dynamic configuration	2016-05-25 14:17:05 +02:00
Alexander Kukushkin	d422e16aad	Implement reload of config.yaml on SIGHUP If some changes require restart of postgres patroni will expose `restart_pending` flag in DCS and via REST API	2016-05-13 13:31:21 +02:00
Alexander Kukushkin	defc987328	Encode request body only once in a MockRequest to avoid using bytestrings all over the file	2016-05-09 09:33:54 +02:00
Alexander Kukushkin	499061918d	Implement noloadbalance support Mostly this tag is necessary to give a hint to load balancer auto-configuration tool that node should not be included into LB configuration. In addition to that Patroni also should not return status_code=200 for a health check if the tag is present and value is not `False`.	2016-04-22 09:46:34 +02:00
Feike Steenbergen	f317b9b9a6	Include database system identifier in cluster info	2016-04-18 10:44:35 +02:00
Alexander Kukushkin	b4e86f0809	Make it possible to schedule failover in less then 10 seconds But only when API request was posted to the leader	2016-04-13 13:32:39 +02:00
Alexander Kukushkin	3a7d2c3874	Remove unused code from unit tests	2016-03-21 20:48:17 +01:00
Alexander Kukushkin	9fec8a41e4	Return different status if failed over not to candidate	2016-03-19 13:15:05 +01:00
Oleksii Kliukin	aa844b63d0	Avoid an unhandled exception in the API thread. When receiving a failover request with no data or non-JSON data, emit a message to the client instead of crashing.	2016-03-04 19:21:14 +01:00
Alexander Kukushkin	dbb3e8308b	Merge branch 'master' of github.com:zalando/patroni into codequality	2016-02-22 19:20:23 +01:00
Alexander Kukushkin	4038d94c5a	Fix more codacy issues	2016-02-17 14:59:17 +01:00
Alexander Kukushkin	a210cfd1ab	Fix more codacy issues	2016-02-17 14:51:59 +01:00
Alexander Kukushkin	1b9e77fe83	pep8 formatting	2016-02-17 12:34:04 +01:00
Alexander Kukushkin	a875e93f2e	Merge branch 'master' of github.com:zalando/patroni into feature/scheduled_failover_squashed	2016-02-17 12:14:10 +01:00
Alexander Kukushkin	df9b8fed2e	Improve quality of code by resolving issues found by quantifiedcode and codacy	2016-02-12 12:23:49 +01:00
Feike Steenbergen	1e2fdac891	Scheduled Failover tests Add tests for the scheduled failover feature, also add more and better tests for patronictl.	2016-02-10 14:19:41 +01:00
Feike Steenbergen	bce96df177	Add attributes to Mocked classes	2016-01-29 13:29:51 +01:00
Alexander Kukushkin	57f19fb149	Merge pull request #80 from zalando/feature/nofailover Feature/nofailover	2015-11-16 10:21:56 +01:00
Oleksii Kliukin	28934350ef	Handle haproxy requests. Improve failover status code. By default, haproxy sens an OPTION request, which we didn't handle until now. In addition, all haproxy requests that doesn't examine the request body close the connection as soon as the status code is obtained. Such behavior breaks BaseHTTPRequestHandler, namely handle_one_request, which doesn't check for connection reset by peer and throw this error on a higher level, but since we don't call this function directly, there is no place in the code to catch it, therefore, we have to patch this function in the base class. In addition, patch the StreamRequestHandler finish() function in order to handle the connection reset error. Re-read the cluster from DCS right after the failover to supply the correct new values to the API thread. Fix a typo.	2015-11-12 17:38:22 +01:00

1 2 3

120 Commits