patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 18:20:05 +00:00

Author	SHA1	Message	Date
Ants Aasma	70d718a058	Simplify watchdog code (#452 ) * Only activate watchdog while master and not paused We don't really need the protections while we are not master. This way we only need to tickle the watchdog when we are updating leader key or while demotion is happening. As implemented we might fail to notice to shut down the watchdog if someone demotes postgres and removes leader key behind Patroni's back. There are probably other similar cases. Basically if the administrator if being actively stupid they might get unexpected restarts. That seems fine. * Add configuration change support. Change MODE_REQUIRED to disable leader eligibility instead of closing Patroni. Changes watchdog timeout during the next keepalive when ttl is changed. Watchdog driver and requirement can also be switched online. When watchdog mode is `required` and watchdog setup does not work then the effect is similar to nofailover. Add watchdog_failed to status API to signify this. This is True only when watchdog does not work AND it is required. * Reset implementation when config changed while active. * Add watchdog safety margin configuration Defaults to 5 seconds. Basically this is the maximum amount of time that can pass between the calls to odcs.update_leader()` and `watchdog.keepalive()`, which are called right after each other. Should be safe for pretty much any sane scenario and allows the default settings to not trigger watchdog when DCS is not responding. * Cancel bootstrap if watchdog activation fails The system would have demoted itself anyway the next HA loop. Doing it in bootstrap gives at least some other node chance to try bootstrapping in the hope that it is configured correctly. If all nodes are unable to activate they will continue to try until the disk is filled with moved datadirs. Perhaps not ideal behavior, but as the situation is unlikely to resolve itself without administrator intervention it doesn't seem too bad.	2017-07-27 12:16:11 +02:00
Alexander Kukushkin	d5b3d94377	Custom bootstrap (#454 ) Task of restoring a cluster from backup or cloning existing cluster into a new one was floating around for some time. It was kind of possible to achieve it by doing a lot of manual actions and very error prone. So I come up with the idea of making the way how we bootstrap a new cluster configurable. In short - we want to run a custom script instead of running initdb.	2017-07-18 15:12:58 +02:00
Alexander Kukushkin	acc6d7c2c2	Watchdog unit-tests, bugfixes and questions (#449 ) Implement missing unit-tests for and drop unused code	2017-07-11 10:00:30 +02:00
Alexander Kukushkin	681b6b507b	Support unix sockets when connecting to a local postgres cluster (#457 ) For backward compatibility this feature is not enabled by default. To enable it you have to set `postgresql.use_unix_socket: true`. If feature is enable, and `unix_socket_directories` is defined and non empty, Patroni will use the first suitable value from it to connect to the local postgres cluster. If the `unix_socket_directories` is not defined, Patroni will assume that default value should be used and will not pass `host` to command line arguments and omit it from connection url. Solves: https://github.com/zalando/patroni/issues/61 In addition to mentioned above, this commit solves couple of bugs: * manual failover with pg_rewind in a pause state was broken * psycopg2 (or libpq, I am not really sure what exactly) doesn't mark cursors connection as closed when we use unix socket and there is an `OperationalError` occurs. We will close such connection on our own.	2017-06-22 11:47:57 +02:00
Ants Aasma	a70b46ef13	Add watchdog support on Linux (#343 ) Ensures that system gets rebooted before TTL runs out. Initial version. Open questions: Do we want to disable watchdog while we are not master?	2017-06-01 16:53:46 +02:00
Alexander Kukushkin	37c1552c0a	Smart pg_rewind (#417 ) Previously we were running pg_rewind only in limited amount of cases: * when we knew postgres was a master (no recovery.conf in data dir) * when we were doing a manual switchover to a specific node (no guaranty that this node is the most up-to-date) * when a given node has nofailover tag (it could be ahead of new master) This approach was kind of working in most of the cases, but sometimes we were executing pg_rewind when it was not necessary and in some other cases we were not executing it although it was needed. The main idea of this PR is first try to figure out that we really need to run pg_rewind by analyzing timelineid, LSN and history file on master and replica and run it only if it's needed.	2017-05-19 16:32:06 +02:00
Alexander Kukushkin	39f5f7982c	Scheduled failovers in 1 second don't work reliably with loop_wait=2	2017-01-13 11:25:07 +01:00
Alexander Kukushkin	1f829a4b34	Switch to trusty and run acceptance tests with postgres 9.6	2017-01-13 09:32:38 +01:00
Alexander Kukushkin	d138a8db17	AT for master_start_timeout + minor fixes (#361 )	2016-12-09 12:02:41 +01:00
Alexander Kukushkin	37b020e7a3	Various bugfixes and improvements: (#346 ) * Replace pytz.UTC with dateutil.tz.tzutc, it helps to reduce memory by more than 4Mb... * fix check of python version: 0x0300000 => 0x3000000 * Update leader key before restart and demote	2016-11-04 18:42:56 +02:00
Ants Aasma	7e53a604d4	Add synchronous replication support. (#314 ) Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions. To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over. We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions. On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS. When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock. Added acceptance tests and documentation. Implementation by @ants with extensive review by @CyberDem0n.	2016-10-19 16:12:51 +02:00
Alexander Kukushkin	1e573aec8f	Do session/renew call to Consul when update_leader is called (#336 )	2016-10-10 10:05:55 +02:00
Alexander Kukushkin	4594bc98da	Increase timeouts when running AT on travis (#324 ) * Increase timeouts two times when running AT on travis * Make up to 3 attempts to download DCS * Get rid from hard-coded names	2016-09-28 15:13:09 +02:00
Alexander Kukushkin	10c7fa41f3	Exclude unhealthy nodes when choosing where to clone from (#313 ) Node MUST have tag clonefrom: true, be in the 'running' state and also we should not try to clone from itself.	2016-09-21 09:42:48 +02:00
Alexander Kukushkin	0b1bfeca5b	Make sure that we are running and testing latest versions of everything (#303 )	2016-09-19 13:32:53 +02:00
Alexander Kukushkin	33ff372ef6	Always try to rewind on manual failover	2016-09-01 11:08:26 +02:00
Alexander Kukushkin	1dcdd6eaa0	Acceptance tests for pause mode	2016-08-30 16:50:07 +02:00
Alexander Kukushkin	366ed9cc52	fix pep8 formatting and implement missing tests	2016-08-29 15:39:24 +02:00
Murat Kabilov	a47a2bceff	Manage scheduled restarts using patronictl (#248 ) Manage scheduled restarts using patronictl	2016-08-09 12:54:48 +02:00
Oleksii Kliukin	ffd27b5705	Rename with_pending_restart to restart_pending.	2016-07-13 11:07:37 +02:00
Oleksii Kliukin	bf95b75489	Use the parameter that really sets the pending_restart flag.	2016-07-11 18:20:15 +02:00
Oleksii Kliukin	c91eda8d78	Merge branch 'master' into feature/scheduled_restarts	2016-07-11 12:56:24 +02:00
Alexander Kukushkin	ae88e7c96e	Document that every single zookeeper host:port MUST be quoted otherwise yaml library can not parse the list. And make visible yaml exception when trying to parse this list.	2016-06-29 14:25:50 +02:00
Oleksii Kliukin	7a1e2e0c72	Fix the assert message.	2016-06-28 17:11:13 +02:00
Oleksii Kliukin	d2832ee43b	Address the code review. Fix return value in the should_run_scheduled_action and the comments. Correct the json composition in the scheduled_restart test. Fix the delete in case there is no scheduled restart. Fix the usage of format in the logger output. Fix the indentation in the evaluate_scheduled_restart. Fix the condition related to the body_is_optional in the do_POST_restart. Fix a few typos in the error messages. Fix the _read_json_content Make the scheduled restart unit-tests a bit less ugly	2016-06-28 16:54:20 +02:00
Oleksii Kliukin	29845dd383	Restart the node according to the schedule. The scheduled restart data structures are now independent of those used by the normal restarts. This would be fixed in subsequent commits. Add the behave tests, that cover the POST /restart (but not DELETE).	2016-06-23 10:43:54 +02:00
Alexander Kukushkin	27bdc65e46	Fix acceptance tests with python3	2016-06-16 15:27:41 +02:00
Alexander Kukushkin	fcde17583c	Acceptance tests for patronictl Call patronictl.py when it's possible instead of doing REST API calls.	2016-06-16 15:06:18 +02:00
Alexander Kukushkin	5f4e582660	Merge branch 'master' of github.com:zalando/patroni into feature/dynamic-configuration	2016-06-09 11:04:28 +02:00
Alexander Kukushkin	50d118c3aa	Split ZooKeeper and Exhibitor Originally Exhibitor was supported in the ZooKeeper class and configuration for Exhibitor was taken also from `zookeeper` section in the yaml config file. In fact, Exhibitor just extends ZooKeeper and now it is reflected in the code and also Exhibitor got it's own section in the config.yaml file. It will make it easier to configure Exhibitor hosts and port via environment variables when PR#211 will be merged.	2016-06-08 19:21:18 +02:00
Alexander Kukushkin	24822bd9ac	Returning 304 for POST, PATCH, PUT is not good idea	2016-06-06 10:50:42 +02:00
Alexander Kukushkin	ebb9e252d8	Rename restart_pending to pending_restart for compatibility	2016-06-02 09:31:30 +02:00
Alexander Kukushkin	1c30948ef9	Implement PUT /config and enhance some checks	2016-06-01 17:06:31 +02:00
Alexander Kukushkin	b7359e7b0d	Rollback all changes to basic_replication.feature since I moved all functionality to patroni_api.feature	2016-05-30 12:40:52 +02:00
Alexander Kukushkin	f7912991a8	Reshuffle acceptance tests one more time	2016-05-30 12:37:14 +02:00
Alexander Kukushkin	e085c866dc	Reshuffle acceptance tests Move dynamic config tests from basic_replication to patroni_api	2016-05-30 11:30:41 +02:00
Alexander Kukushkin	073ef3784f	Implement PATCH /config	2016-05-27 16:29:33 +02:00
Alexander Kukushkin	6700cd0aa6	Implement reload of config.yml with REST API call and acceptance tests for that	2016-05-26 17:09:40 +02:00
Alexander Kukushkin	45cbc8ca70	Implement acceptance test for dynamic configuration functionality and fix some bugs revealed by acceptance tests	2016-05-26 10:16:24 +02:00
Alexander Kukushkin	ceace03646	Address codacy and travis issues	2016-05-25 14:49:33 +02:00
Alexander Kukushkin	7827951c8c	Dynamic configuration	2016-05-25 14:17:05 +02:00
Alexander Kukushkin	eabfd82a5d	Implement Consul support	2016-04-27 10:59:01 +02:00
Alexander Kukushkin	fd4f12aac8	Do not assume that connection user is postgres, but take it from config.yml	2016-04-21 13:56:09 +02:00
Alexander Kukushkin	7006a4ee14	Sometimes replica can't attach to the master after pg_rewind The reason for that is: it takes up to 10 seconds to create replication slot + up to 5 seconds to start straming and recover.	2016-04-13 14:28:00 +02:00
Alexander Kukushkin	d57310bbc0	Fix one more corner-case It could take up to 10 seconds to create replication slot. In addition to that when replica fails to connect to the master via streaming replication it doesn't retry immediately, but with some timeout (5 seconds). 10 + 5 == 15 what causes replication check scenarios fail.	2016-04-13 14:09:45 +02:00
Alexander Kukushkin	01da5266a0	Give time for running healh-checks when promoting replica	2016-04-13 13:32:39 +02:00
Alexander Kukushkin	b4e86f0809	Make it possible to schedule failover in less then 10 seconds But only when API request was posted to the leader	2016-04-13 13:32:39 +02:00
Alexander Kukushkin	15d30a2d35	Try to stabilize acceptance tests	2016-04-13 13:32:39 +02:00
Alexander Kukushkin	f8bf1bb0ab	Disable sudo, reshuffle travis tasks and introduce caching Without sudo travis is executing build tasks using docker and waiting time in this case is really small, usually not longer then 10 seconds. postgresql-9.5 is installed via addons.apt.packages (without sudo) But ports 5432 and 5433 are busy. So I had to ajust environment.py to assign port from higher diapason. And a few words about build tasks: First task is used for executing unit tests for all different python versions The second one is used for executing acceptance tests against etcd The third one is used for executing acceptance tests against zookeeper acceptance tests are executed with python2.7 and python3.5 In addition that I've introduced caching of python virtual environment. It really helps to reduce time needed to install python modules.	2016-04-13 13:32:39 +02:00
Alexander Kukushkin	24a2ea6cef	Refactor acceptance tests to make them work against ZooKeeper and make it easier to implement controllers for new DCS, i.e. consul	2016-04-10 10:37:43 +02:00

1 2

84 Commits