patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 18:20:05 +00:00

Author	SHA1	Message	Date
Oleksii Kliukin	e754e0927e	Merge branch 'master' of https://github.com/zalando/patroni v1.3	2017-07-27 15:59:05 +02:00
Oleksii Kliukin	895e46885a	Patroni 1.3 - add release notes - update the version	2017-07-27 15:58:31 +02:00
Oleksii Kliukin	4a91063c82	Merge branch 'master' of https://github.com/zalando/patroni	2017-07-27 12:47:25 +02:00
Alexander Kukushkin	6300ec4dbf	Implement missing tests for watchdog (#487 ) and fix one bug	2017-07-27 12:41:46 +02:00
Ants Aasma	70d718a058	Simplify watchdog code (#452 ) * Only activate watchdog while master and not paused We don't really need the protections while we are not master. This way we only need to tickle the watchdog when we are updating leader key or while demotion is happening. As implemented we might fail to notice to shut down the watchdog if someone demotes postgres and removes leader key behind Patroni's back. There are probably other similar cases. Basically if the administrator if being actively stupid they might get unexpected restarts. That seems fine. * Add configuration change support. Change MODE_REQUIRED to disable leader eligibility instead of closing Patroni. Changes watchdog timeout during the next keepalive when ttl is changed. Watchdog driver and requirement can also be switched online. When watchdog mode is `required` and watchdog setup does not work then the effect is similar to nofailover. Add watchdog_failed to status API to signify this. This is True only when watchdog does not work AND it is required. * Reset implementation when config changed while active. * Add watchdog safety margin configuration Defaults to 5 seconds. Basically this is the maximum amount of time that can pass between the calls to odcs.update_leader()` and `watchdog.keepalive()`, which are called right after each other. Should be safe for pretty much any sane scenario and allows the default settings to not trigger watchdog when DCS is not responding. * Cancel bootstrap if watchdog activation fails The system would have demoted itself anyway the next HA loop. Doing it in bootstrap gives at least some other node chance to try bootstrapping in the hope that it is configured correctly. If all nodes are unable to activate they will continue to try until the disk is filled with moved datadirs. Perhaps not ideal behavior, but as the situation is unlikely to resolve itself without administrator intervention it doesn't seem too bad.	2017-07-27 12:16:11 +02:00
Oleksii Kliukin	9e545a95b4	Merge branch 'master' of https://github.com/zalando/patroni	2017-07-26 12:18:32 +02:00
Alexander Kukushkin	e2feac87bc	Block callbacks during bootstrap (#483 ) It wasn't a big issue when on_start was called during normal boostrap with initdb, because usually such process is very fast. But situation is changing when we run custom bootstrap, becuase it might be a long time between cluster become connectable and end of recovery and promote. Actually situation was even worse than that, on_start was called with the `replica` argument and later on_role_changes was never called, because promote wasn't performed by Patroni. As a solution for this problem we will block any callbacks during bootstrap and explicitly call on_start after leader lock was taken.	2017-07-24 14:19:19 +02:00
Alexander Kukushkin	cb360f089c	Restart postgres after custom bootstrap if hba_file is defined in configuration (#482 ) In addition to that always use absolute paths to config files. Fixes https://github.com/zalando/patroni/issues/481	2017-07-22 09:46:05 +02:00
Alexander Kukushkin	7e066a18cb	Install locales package and define default locale as en_US.UTF-8 (#480 ) otherwise initdb fails	2017-07-22 09:45:28 +02:00
Oleksii Kliukin	895eefaa51	Document bootstrapping and replica creation (#478 ) Describe parameters around custom replica creation and bootstrap	2017-07-19 12:25:50 +02:00
Oleksii Kliukin	168c361e59	Merge branch 'master' of https://github.com/zalando/patroni	2017-07-18 19:30:52 +02:00
Alexander Kukushkin	d5b3d94377	Custom bootstrap (#454 ) Task of restoring a cluster from backup or cloning existing cluster into a new one was floating around for some time. It was kind of possible to achieve it by doing a lot of manual actions and very error prone. So I come up with the idea of making the way how we bootstrap a new cluster configurable. In short - we want to run a custom script instead of running initdb.	2017-07-18 15:12:58 +02:00
Alexander Kukushkin	e2cda83496	DEBIAN_FRONTEND=noninteractive to stop apt-get upgrade asking questions (#476 ) + Don't install unnecessary python modules + Bump etcd version + Fix etcd arguments and add missing python modules	2017-07-17 15:15:55 +02:00
Oleksii Kliukin	200412fbba	Merge branch 'master' of https://github.com/zalando/patroni	2017-07-17 10:04:46 +02:00
Andrey Martyanov	33d3cc433d	Update python-etcd dependency (#472 ) * Update python-etcd dependency to use acceptable version range.	2017-07-13 12:11:02 +02:00
Andrey Martyanov	2877f75288	Improve haproxy.cfg (#473 ) * Use more readable 'listen' section * Explicitely defined check options * Enable stats dashboard * Close HAProxy connections to a backend server when it's marked as down.	2017-07-13 12:09:28 +02:00
Oleksii Kliukin	be187460f4	Merge branch 'master' of https://github.com/zalando/patroni	2017-07-12 11:20:51 +02:00
Alexander Kukushkin	acc6d7c2c2	Watchdog unit-tests, bugfixes and questions (#449 ) Implement missing unit-tests for and drop unused code	2017-07-11 10:00:30 +02:00
Nick Stott	0b2134aba3	truncate the fqdn to 64 chars, or NAMEDATALEN-1 (#470 ) as described here, https://github.com/postgres/postgres/blob/master/src/backend/replication/slot.c#L164-L210 the slot name should be truncated to 63 chars, getting an error related to a slot_name ``` FATAL: replication slot name "c_formationid4_main_m_1_c_formationid4_main_m_nicksaccount_svc_c" is too long ``` the leader has the following slot names ``` postgres=# select * from pg_replication_slots; slot_name \| plugin \| slot_type \| datoid \| database \| active \| active_pid \| xmin \| catalog_xmin \| restart_ls n \| confirmed_flush_lsn -----------------------------------------------------------------+--------+-----------+--------+----------+--------+------------+------+--------------+----------- --+--------------------- c_formationid4_main_m_1_c_formationid4_main_m_nicksaccount_svc_ \| \| physical \| \| \| f \| \| \| \| \| c_formationid4_main_m_2_c_formationid4_main_m_nicksaccount_svc_ \| \| physical \| \| \| f \| \| \| \| \| (2 rows) ```	2017-07-05 09:56:14 +02:00
jouir	4ca94a5dab	Add config_dir option for configuration files location (#466 ) On debian, the configuration files (postgresql.conf, pg_hba.conf, etc) are not stored in the data directory. It would be great to be able to configure the location of this separate directory. Patroni could override existing configuration files where they are used to be. The default is to store configuration files in the data directory. This setting is targeting custom installations like debian and any others moving configuration files out of the data directory. Fixes #465	2017-07-04 16:14:17 +02:00
jouir	b60f65a2ca	Compatibility with old kazoo (#468 ) ```python AttributeError: 'KazooClient' object has no attribute '_retry' ``` Fixes #467	2017-07-04 16:13:18 +02:00
Alexander Kukushkin	23e6f65156	Don't wal_keep_segments as command line argument to postgres (#460 ) It make it not possible to change it without restart. Fixes https://github.com/zalando/patroni/issues/459	2017-07-03 12:07:24 +02:00
Alexander Kukushkin	10f2321334	Don't fail if one of DCS implementation can't be loaded (#463 ) It might be that it's not required by configuration.	2017-07-03 12:07:19 +02:00
jouir	fa68eba33e	Bugfix sleep func missing in Kazoo client (#464 ) This patch is adding the sleep function needed in the connection_retry and command_retry parameters. The KazooClient is trying to compare them with the ones included in the handler at instantiation.	2017-07-03 12:07:09 +02:00
Alexander Kukushkin	b576e69362	Manage pg_hba.conf via patroni config or dynamic_configuration (#458 ) So far Patroni was populating pg_hba.conf only when running bootstrap code and after that it was not very handy to manage it's content, because it was necessary to login to every node, change pg_hba.conf manually and run pg_ctl reload. This commit intends to fix it and give Patroni control over pg_hba.conf. It is possible to define pg_hba.conf content via `postgresql.pg_hba` in the patroni configuration file or in the `DCS/config` (dynamic configuration). If the `hba_file` is defined in the `postgresql.parameters`, Patroni will ignore `postgresql.pg_hba`.	2017-06-23 12:38:25 +02:00
Oleksii Kliukin	86420e20b4	Merge branch 'master' of https://github.com/zalando/patroni	2017-06-23 08:59:30 +02:00
Alexander Kukushkin	681b6b507b	Support unix sockets when connecting to a local postgres cluster (#457 ) For backward compatibility this feature is not enabled by default. To enable it you have to set `postgresql.use_unix_socket: true`. If feature is enable, and `unix_socket_directories` is defined and non empty, Patroni will use the first suitable value from it to connect to the local postgres cluster. If the `unix_socket_directories` is not defined, Patroni will assume that default value should be used and will not pass `host` to command line arguments and omit it from connection url. Solves: https://github.com/zalando/patroni/issues/61 In addition to mentioned above, this commit solves couple of bugs: * manual failover with pg_rewind in a pause state was broken * psycopg2 (or libpq, I am not really sure what exactly) doesn't mark cursors connection as closed when we use unix socket and there is an `OperationalError` occurs. We will close such connection on our own.	2017-06-22 11:47:57 +02:00
Alexander Kukushkin	3fee62c39b	BUGFIX: retry on boto exceptions never worked (#450 ) because `boto.exception` is not an excpetion, but a python module. + increase retry timeout to 5 minutes + refactor unit-tests to cover the case with retries.	2017-06-16 10:27:03 +02:00
Alexander Kukushkin	5bd9aa7547	BUGFIX: pg_rewind wasn't working when data page checksum is not enabled (#456 ) pg_controldata output depends on postgres major version and in some cases some of the parameters are prefixed by 'Current ' for old postgres versions. Bug was introduced by commit `37c1552`. Fixes https://github.com/zalando/patroni/issues/455	2017-06-16 10:25:54 +02:00
Oleksii Kliukin	dfd44628a6	Merge branch 'master' of https://github.com/zalando/patroni	2017-06-13 10:27:36 +02:00
Ants Aasma	a70b46ef13	Add watchdog support on Linux (#343 ) Ensures that system gets rebooted before TTL runs out. Initial version. Open questions: Do we want to disable watchdog while we are not master?	2017-06-01 16:53:46 +02:00
Alexander Kukushkin	e3a01727a9	Implement missing tests and add pg-10 support to wale_restore(#446 ) in addition to that get rid from two modules and fix formatting of tests	2017-05-22 12:01:02 +02:00
Alexander Kukushkin	cd84dc82b6	Implement postgresql-10 support (#444 ) Mainly it handles rename of xlog to wal. In the API and inside DCS it is still named xlog (for compatibility). * Address feedback	2017-05-19 17:04:53 +02:00
Alexander Kukushkin	7633b19213	Support change of superuser and replication credentials on reload (#445 ) Fixes: https://github.com/zalando/patroni/issues/353 and: https://github.com/zalando/patroni/issues/443	2017-05-19 16:32:35 +02:00
Alexander Kukushkin	37c1552c0a	Smart pg_rewind (#417 ) Previously we were running pg_rewind only in limited amount of cases: * when we knew postgres was a master (no recovery.conf in data dir) * when we were doing a manual switchover to a specific node (no guaranty that this node is the most up-to-date) * when a given node has nofailover tag (it could be ahead of new master) This approach was kind of working in most of the cases, but sometimes we were executing pg_rewind when it was not necessary and in some other cases we were not executing it although it was needed. The main idea of this PR is first try to figure out that we really need to run pg_rewind by analyzing timelineid, LSN and history file on master and replica and run it only if it's needed.	2017-05-19 16:32:06 +02:00
Philipp	fed7a24889	Init.d script for Debian (#441 )	2017-05-19 16:29:10 +02:00
Joar Wandborg	3725b4bbcb	Add wal-e threshold logging (#440 ) Add error log for threshold, so that you can see if thresholds were the reason for ExitCode.FAIL.	2017-05-19 16:28:29 +02:00
Ants Aasma	644b741969	Add config editing to patronictl (#428 ) Current UI to change cluster configuration is somewhat unfriendly, involving a curl command, knowing the REST API endpoint, knowing the specific syntax to call it with and writing a JSON document. I added two commands in this branch to make this a bit easier, `show-config` and `edit-config` (names are merely placeholders, any opinions on better ones?). * `patronictl show-config clustername` fetches the config from DCS, formats it as YAML and outputs it. * `patronictl edit-config clustername` fetches the config, formats it as YAML, invokes $EDITOR on it, then shows user the diff and after confirmation applies the changed config to DCS, guarding for concurrent modifications. * `patronictl edit-config clustername --set synchronous_mode=true --set postgresql.use_slots=true` will set the specific key-value pairs. There are also some UI capabilities I'm less sure of, but included them here as I already implemented them. * If output is a tty then the diffs are colored. I'm not sure if this feature is cool enough to pull the weight of adding a dependency on cdiff. Or maybe someone knows of another more task focused diff coloring library? * `patronictl edit-config clustername --pg work_mem=100MB` - Shorthand for `--set postgresql.parameters.work_mem=100MB` * `patronictl edit-config clustername --apply changes.yaml` - apply changes from a yaml file. * `patronictl edit-config clustername --replace new-config.yaml` - replace config with new version.	2017-05-19 16:25:21 +02:00
Alexander Kukushkin	2263d68490	bugfix: async executor was not set to busy when bootstrap without leader (#439 ) This was causing patroni to exit with following exception: 2017-04-27 08:54:25,000 CRITICAL: system ID mismatch, node foo-0 belongs to a different cluster: 6347967071236960319 != 2017-04-27 08:54:25,221 ERROR: Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/patroni/__init__.py", line 134, in patroni_main patroni.run() File "/usr/local/lib/python3.5/dist-packages/patroni/__init__.py", line 110, in run logger.info(self.ha.run_cycle()) File "/usr/local/lib/python3.5/dist-packages/patroni/ha.py", line 946, in run_cycle info = self._run_cycle() File "/usr/local/lib/python3.5/dist-packages/patroni/ha.py", line 903, in _run_cycle sys.exit(1) SystemExit: 1	2017-04-27 14:36:35 +02:00
Oleksii Kliukin	728a521c1d	Export locale variables when forking a postmaster. (#437 ) Avoid "postmaster became multithreaded during startup" error on OS X built with --enable-nls (default for petere/homebrew). The issue is that on OS X the libintl that replaces setlocale() spawns a thread when it needs to detect the current locale. Setting the LC_ and LANG variables prevents this, but we don't propagate them to the fork. See: https://www.postgresql.org/message-id/flat/20140902013458.GB906981%40tornado.leadboat.com#20140902013458.GB906981@tornado.leadboat.com for an additional explanation of the original Postgres behavior.	2017-04-27 14:35:01 +02:00
Joar Wandborg	3241ec2504	Use csv.DictReader when parsing wal-e backup-list (#436 ) wal-e outputs in CSV format using the 'excel-tab' dialect: `3164de6852/wal_e/operator/backup.py (L63)` The ISO date may be written with a space instead of'T' as delimiter between date and time, this causes the old parsing to fail.	2017-04-27 14:33:22 +02:00
Alexander Kukushkin	44a7142a9d	Synchronous mode strict (#435 ) If synchronous_mode_strict==true then '*' will be written as synchronous_standby_names when that last replication host dies.	2017-04-27 14:32:15 +02:00
jamessewell	208c7abaab	Update the startup script timeouts (#399 ) * Update the startup script shutdown / startup timeout to allow the current sync node time to release the role in synchronous_mode. Without this change Patroni will be killed while waiting, and will leave PostgreSQL up rather than stopping it. * Updates to support strict sync mode. If synchronous_mode_strict==true then 'patroni_dummy_host' will be written as synchronous_standby_names when that last replication host dies. * Oops, mixed two pull requests - backing out one	2017-04-27 14:30:54 +02:00
Oleksii Kliukin	02698acd69	Bumped version to 1.2.5 v1.2.5	2017-04-20 12:47:09 +02:00
Oleksii Kliukin	8553e120e4	Bump up to 1.2.5 (#434 )	2017-04-20 12:45:26 +02:00
Ants Aasma	856a13e24c	Remove error spinning on etcd failure and reduce log spam (#429 ) When all etcd servers refuse connections during watch the call will fail with an exception and will be immediately retried. This creates a huge amount of log spam potentially creating additional issues on top of losing the DCS. This patch takes note if etcd failures are repeating and starting from the second failure will sleep for a second before retrying. It additionally omits the stack trace after the first failure in a streak of failures.	2017-04-20 12:40:15 +02:00
Alex Kerney	1d513e7e04	Release the leader key when the leader restarts with an empty data dir (#420 ) * If the leader has an empty data directory it must have been recreated, so release the leader key	2017-04-18 12:45:48 +02:00
Alexander Kukushkin	1c5d5f1dae	BUGFIX: pg_drop_replication_slot may not be called if slot is active (#427 ) Default value of wal_sender_timeout is 60 seconds while we are trying to remove replication slot after 30 seconds (ttl=30). That means postgres might think that slot is still active and does nothing. Patroni at the same time was thinking that it was removed successfully. If the drop replication slot query didn't return any single row we must fetch list of existing physical replication slots from postgres on the next iteration of HA loop. Fixes: issue #425	2017-04-18 12:45:24 +02:00
Oleksii Kliukin	d39f895082	Fix unit tests for Python 3.6 (#431 ) Python 3.6 complains about 'AttributeError: 'MockRequest' object has no attribute 'sendall'	2017-04-18 12:44:42 +02:00
Alexander Kukushkin	dea8f22a37	Fix race condition when opening connection to cluster (#433 ) `Postgresql.connection` method could be called from different threads at the same time resulting in more than one connection open but only one used afterwards.	2017-04-18 12:44:27 +02:00

1 2 3 4 5 ...

1434 Commits