patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 10:20:10 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	7ca55359de	Demote immediately if failed to update leader lock (#316 ) If the Etcd node partitioned from rest of the cluster it is still possible to read from it (though it returns some stale information), but it is not possible to write into it. Previously Patroni was trying to fetch the new cluster view from DCS in order to figure out is it still the leader or not and Etcd is always returning stale info where the node still owns the leader key, but with negative TTL. This weird bug clearly shows how dangerous premature optimization is.	2016-09-20 15:45:21 +02:00
Alexander Kukushkin	453e68637a	Don't try to remove leader key when running ctl on the leader node (#302 )	2016-09-19 13:33:24 +02:00
Alexander Kukushkin	0b1bfeca5b	Make sure that we are running and testing latest versions of everything (#303 )	2016-09-19 13:32:53 +02:00
Alexander Kukushkin	5c8399e4fa	Make sure data directory is empty before trying to restore backup (#307 ) We are doing number of attempts when trying to initialize replica using different methods. Any of this attemp may create and put something into data directory, what causes next attempts fail. In addition to that improve logging when creating replica.	2016-09-19 13:32:27 +02:00
Alexander Kukushkin	540ee2b3c7	Bugfix/fast recover (#300 ) * reap children before and after running HA loop When the Patroni is running in a docker container with the pid=1 it is also responsible for reaping of all dead processes. We can't call os.waitpid immediately after receiving SIGCHLD because it breaks subprocess module. It simply stops receiving exit codes of the processes it executes because these processes. That's why we just registering the fact of receiving SIGCHLD and reaping children only after execution of HA loop. If the postmaster was dying for some reason, Patroni was able to detect this fact only on the next iteration of HA loop, because zombie processes where still there and it was possible to send 0 signal to it. To avoid such situation we should also reap all dead processes before executing HA loop. * Don't rely on _cursor_holder when closing connection it could happen that connection has been opened but not cursor... * Don't "retry" when fetching current xlog location and it fails On every iteration of HA loop we are updaing member key in DCS and among other data there is current xlog location stored in the value. If the postgres has died for some reason it is not possible to fetch xlog position and we are just wasting retry_timeout/2 = 5 seconds there. If this information will be missing from DCS during period of one HA loop nothing should break. Patroni is not relying on this information anyway. When it is doing manual or automatic failover it aways communicates with other nodes directly to get the most fresh infomation. * Don't try to update leader optime when postgres is not 100% healthy `update_lock` method is not only doing update of the leader lock but also writes the most recent value of xlog position into optime/leader key. If you know that postgres can be not 100% healthy because it is in process of restart or recover we should not try to fetch current xlog position and update 'optime/leader'. Previously we were using `AsyncExecutor.busy` property for avoiding of such action, but I think we should be more excpilicit and do the update only if we know that postgres is 100% healty.	2016-09-14 15:13:01 +02:00
Alexander Kukushkin	c2b91d0195	Merge branch 'master' of github.com:zalando/patroni into feature/disable-automatic-failover	2016-09-05 16:03:55 +02:00
Alexander Kukushkin	53bcc5c9bb	Merge pull request #290 from zalando/feature/pyinstaller Binary build with PyInstaller	2016-09-05 14:52:13 +02:00
Feike Steenbergen	6bdaa7fb88	Merge pull request #288 from zalando/bugfix/python3_wale_restore Decode output from wal-e list backup	2016-09-05 14:45:14 +02:00
Alexander Kukushkin	2086c90a4a	Try to get rid from hardcoded names when building binary	2016-09-05 14:11:53 +02:00
Alexander Kukushkin	57a0ac9086	pep8 format of test_wale_restore.py	2016-09-05 12:15:28 +02:00
Feike Steenbergen	5ba1294d60	Fix tests for wal-e restore	2016-09-02 17:04:37 +02:00
Oleksii Kliukin	3f7fa4b41f	Avoid retries when syncing replication slots. (#282 ) * Avoid retries when syncing replication slots. Do not retry postgres queries that fetch, create and drop slots at the end of the HA cycle. The complete run_cycle routine executes with the async_executor lock. This lock is also used with scheduling operations like reinit or restart in different threads. Looks like CPython threading class has fairness issues when multiple threads try to acquire the same lock and one of them executes long-running actions while holding it: the others have little chances of acquiring the lock in order. To get around this issue, the long action (i.e. retrying the query) is removed. Investigation by Ants Aasma and Alexander Kukushkin.	2016-09-02 17:00:37 +02:00
Alexander Kukushkin	19c80df442	Try to mitigate EtcdEventIndexCleared exception (#287 ) This error is send by etcd when Patroni is doing "watch" on leader key which is never updated after creation and etcd cluster receives a lot of updates, what cleans history of events. Instead of doing watch on modifiedIndex + 1 we will do watch on X-Etcd-Index, which is probably still available...	2016-09-02 13:44:47 +02:00
Alexander Kukushkin	db9b62b7ed	Merge branch 'master' of github.com:zalando/patroni into feature/disable-automatic-failover	2016-09-01 11:09:09 +02:00
Alexander Kukushkin	33ff372ef6	Always try to rewind on manual failover	2016-09-01 11:08:26 +02:00
Oleksii Kliukin	46f1c5b690	Merge pull request #269 from zalando/feature/replica-info Return replication information on the api	2016-08-31 13:58:19 +02:00
Alexander Kukushkin	4d72eef164	Execute API restart outside of lock Otherwise it was blocking HA loop...	2016-08-31 12:38:02 +02:00
Alexander Kukushkin	c0fae1b2e9	Merge branch 'feature/disable-automatic-failover' of github.com:zalando/patroni into feature/disable-automatic-failover	2016-08-30 17:03:37 +02:00
Alexander Kukushkin	8028877be0	Remove failover key only after becoming master	2016-08-30 16:49:28 +02:00
Oleksii Kliukin	11359a26a9	Improve incomplete failover is a paused mode. Instead of empying the stale failover key as a master and bailing out, continue with the healthiest node evaluation. This should make the actual master acquire the leader key faster. Emit the warning message as well and add unit tests.	2016-08-30 12:00:51 +02:00
Ants Aasma	fa6bd51ad1	Appease Quantifiedcode about stylistic issues	2016-08-30 00:40:19 +03:00
Ants Aasma	e428c8d0fa	Replace invalid characters in member names for replication slot names PostgreSQL replication slot names only allow names consisting of [a-z0-9_]. Invalid characters cause replication slot creation and standby startup to fail. This change substitutes the invalid characters with underscores or unicode codepoints. In case multiple member names map to identical replication slots master log will contain a corresponding error message. Motivated by wanting to use hostnames as member names. Hostnames often contain periods and dashes.	2016-08-30 00:21:33 +03:00
Alexander Kukushkin	366ed9cc52	fix pep8 formatting and implement missing tests	2016-08-29 15:39:24 +02:00
Alexander Kukushkin	6dc1d9c88e	Trigger reinitialize from api and make it possible to reinitialize in a pause state	2016-08-29 15:38:58 +02:00
Murat Kabilov	799d4c9bb8	Disable command renamed to pause	2016-08-29 14:30:19 +02:00
Murat Kabilov	22e4af3fb1	Fix failover in the paused state	2016-08-29 12:04:30 +02:00
Alexander Kukushkin	9fdd021e08	Fix unit-tests for api	2016-08-29 10:25:46 +02:00
Murat Kabilov	3d1fe3fa49	Introduce is_paused method in the Cluster	2016-08-29 09:29:49 +02:00
Murat Kabilov	89ef5da5ae	Add tests for api; add checks for ctl and api for the paused state case	2016-08-29 08:36:35 +02:00
Alexander Kukushkin	1635f5269e	Merge branch 'master' of github.com:zalando/patroni into feature/disable-automatic-failover	2016-08-26 11:09:43 +02:00
Alexander Kukushkin	ac49835a3c	Possibility to disable automatic failover cluster-wide Any node of the cluster will maintain it's member key until Patroni is running there. Master node will also maintain the leader key until postgres is running as a master. If there is not postgres or it is running 'in_recovery', Patroni will release leader lock. Bootstrap of a new cluster will work (it is possible to specify paused: true) in the `bootstrap.dcs`. Replicas also will be able to join the cluster if the leader lock exist. If the postgres is not running on the node it will not try to bring it up. Also it disables reinitialize and all kind of scheduled actions, i.e. scheduled restart and scheduled failover. In case if DCS stops being reachable Patroni will not "demote" master if the automatic failover was disabled. Patroni will not stop postgres on exit.	2016-08-26 10:51:43 +02:00
Alexander Kukushkin	74166e996c	Fix tests and formatting	2016-08-25 10:09:32 +02:00
Murat Kabilov	4e61ef06a8	Add coverage in requirements Add some tests for patroni ctl	2016-08-24 18:08:23 +02:00
Feike Steenbergen	dd5bc1bc9b	Merge branch 'master' into feature/replica-info	2016-08-24 11:55:33 +02:00
Alexander Kukushkin	96da6340a9	Calculate future restart time dynamically (#268 ) `do_POST_restart` was ramdomly showing not 100% coverage after 2016-08-20 due to hardcoded timestamps.	2016-08-24 09:46:56 +02:00
Feike Steenbergen	1fc8b43b36	Return replication information on the api To enable better monitoring, it is useful to have replication statistics. Addresses issue #261	2016-08-24 09:31:49 +02:00
Alexander Kukushkin	fa7aa71092	Always call on_start callback when starting Patroni (#262 ) When Patroni was "joining" already running postgres it was not calling callbacks, what in some cases causing issues (callback could be used to change routing/load-balancer or assign/remove floating (service) ip. In addition to that we should `start` postgres instead of `restart`-ing it when doing recovery, because in this case 'on_start' callback should be called, instead of 'on_restart'	2016-08-18 09:35:13 +02:00
Oleksii Kliukin	179131893e	Merge branch 'master' into feature/ctl_scaffolding	2016-08-10 11:49:08 +02:00
Alexander Kukushkin	8ef7178ddf	Refactor code dealing with database connection string/params (#255 ) In the original code we were parsing/deparsing url-style connection strings back and forth. That was not really resource greedy but rather annoying. Also it was not really obvious how to switch all local connections to unix-sockets (preferably). This commit isolates different use-cases of working with connection strings and minimizes amount of code parsing and deparsing them. Also it introduces one new helper method in the `Member` object - `conn_kwargs`. This method can accept as a parameter dict object with credentials (username and password). As a result it returns dict object which could be used by `psycopg2.connect` or for building connection urls for pg_rewind, pg_basebackup or some other replica creation methods. Params for local connection are builded in the `_local_connect_kwargs` method and could be changed to unix-socket later easily.	2016-08-10 10:19:52 +02:00
Alexander Kukushkin	413a84836b	Update etcd topology only after original request succeed (#254 ) There is no point to try to update topology until original request is not performed. Also for us it is more important to execute original request rather then keep topology of etcd cluster in sync. In addition to that implement the same retry-timeout logic in the `machines` property which already is used in `api_execute` method.	2016-08-10 10:17:37 +02:00
Alexander Kukushkin	5fe74bec3b	Make different kazoo timeouts depend on loop_wait (#243 ) * Make different kazoo timeouts dependant on loop_wait ping timeout ~ 1/2 * loop_wait connect_timeout ~ 1/2 * loop_wait Originally these values were calculated from negotiated session timeout and didn't worked very well, because it was taking significant time to figure out that connection is dead and reconnect (up to session timeout) and not giving us time to retry. * Address the code review	2016-08-10 10:15:09 +02:00
Murat Kabilov	a47a2bceff	Manage scheduled restarts using patronictl (#248 ) Manage scheduled restarts using patronictl	2016-08-09 12:54:48 +02:00
Oleksii Kliukin	ac7abfdd74	Minor fixes, address final rounds of code review.	2016-08-09 10:00:46 +02:00
Oleksii Kliukin	9fd01f6af4	Remove unused imports.	2016-08-08 16:48:14 +02:00
Oleksii Kliukin	d9102d2703	Remove the necessity of creating a RESTAPI object. - We don't want to export RestApi object, since it initializes the socket and listens on it. - Change get_dcs, so that the explicit scope passed to it will take priority over the one in the configuration file.	2016-08-08 16:15:57 +02:00
Oleksii Kliukin	53f991df0f	More code-review related fixes - Add missing delete_cluster. - Simplify parts of the code by removing exception handlers where they are not needed. - Fix typos.	2016-08-08 15:30:33 +02:00
Oleksii Kliukin	eeb8f1b694	Further address code reviews. - Fix the issue in ctl that would result in setting the listen_address to True. - Minor stylistic issues. - Add unit-tests.	2016-08-08 12:21:01 +02:00
Oleksii Kliukin	13b4306f40	Remove one more occurrence of the time bomb	2016-07-14 16:53:02 +02:00
Oleksii Kliukin	6c9ffa4d3c	Address the code review In particular, replace the fixed dates for the future actions in the unit tests with those that depend on the current date, avoiding the "timebomb" effect.	2016-07-14 16:39:35 +02:00
Oleksii Kliukin	3181c4e59f	Code review, asynchronous restarts. - Make the restart initiated by the schedule asynchronous - Fix the placeholders in logs. - Fix the regexp to detect the PostgreSQL version.	2016-07-12 20:25:01 +02:00

... 2 3 4 5 6 ...

568 Commits