When Patroni was "joining" already running postgres it was not calling
callbacks, what in some cases causing issues (callback could be used to
change routing/load-balancer or assign/remove floating (service) ip.
In addition to that we should `start` postgres instead of `restart`-ing
it when doing recovery, because in this case 'on_start' callback should
be called, instead of 'on_restart'
In the original code we were parsing/deparsing url-style connection
strings back and forth. That was not really resource greedy but rather
annoying. Also it was not really obvious how to switch all local
connections to unix-sockets (preferably).
This commit isolates different use-cases of working with connection
strings and minimizes amount of code parsing and deparsing them. Also it
introduces one new helper method in the `Member` object - `conn_kwargs`.
This method can accept as a parameter dict object with credentials
(username and password). As a result it returns dict object which could
be used by `psycopg2.connect` or for building connection urls for
pg_rewind, pg_basebackup or some other replica creation methods.
Params for local connection are builded in the `_local_connect_kwargs`
method and could be changed to unix-socket later easily.
There is no point to try to update topology until original request is
not performed. Also for us it is more important to execute original
request rather then keep topology of etcd cluster in sync.
In addition to that implement the same retry-timeout logic in the
`machines` property which already is used in `api_execute` method.
* Make different kazoo timeouts dependant on loop_wait
ping timeout ~ 1/2 * loop_wait
connect_timeout ~ 1/2 * loop_wait
Originally these values were calculated from negotiated session timeout
and didn't worked very well, because it was taking significant time to
figure out that connection is dead and reconnect (up to session timeout)
and not giving us time to retry.
* Address the code review
- We don't want to export RestApi object, since it initializes the
socket and listens on it.
- Change get_dcs, so that the explicit scope passed to it will take
priority over the one in the configuration file.
In particular, replace the fixed dates for the future actions
in the unit tests with those that depend on the current date,
avoiding the "timebomb" effect.
Although such situation should not happen in reality (follow method is
not supposed to be called when when the node is holding leader lock and
postgres is running), but to be on the safe side it is better to
implement as much checks as possible, because this method could
potentially remove data directory.
Client class takes care about retrying when connection to the etcd node
fails. It calculates amount of retries and timeout depending on etcd
cluster size.
Etcd class should not retry when EtcdConnectionFailed exception is
raised (this case is already handled in the Client).
Besides that adjust retry timeouts in the Client class.
rewind is not possible when:
1) trying to rewind from themself
2) leader is not reachable
3) leader is_in_recovery
All these cases were leading to removing of data directory...
In all cases except 1) it should "retry" when leader will became
available and not is_in_recovery.
Fix return value in the should_run_scheduled_action and the comments.
Correct the json composition in the scheduled_restart test.
Fix the delete in case there is no scheduled restart.
Fix the usage of format in the logger output.
Fix the indentation in the evaluate_scheduled_restart.
Fix the condition related to the body_is_optional in the do_POST_restart.
Fix a few typos in the error messages.
Fix the _read_json_content
Make the scheduled restart unit-tests a bit less ugly
In addition for that make pg_ctl --timeout option configurable.
If the stop or start didn't succeeded during given timeout when demoting
master, role will be forcibly changed to 'unknown' and all needed
callbacks executed.
Make sure the scheduled restart flag is cleared when the
postmaster_start_time changes since the time restart was scheduled.
Additionally, separate the logic of checking the restart conditions
into the function in order to support conditions for the normal
restart as well.
The scheduled restart data structures are now independent of those
used by the normal restarts. This would be fixed in subsequent
commits.
Add the behave tests, that cover the POST /restart (but not DELETE).
The scheduled restart API extends the already existing restart
endpoint by processing the parameters in the request body.
Only one scheduled restart at a time is support. DELETE method
on the /restart endpoint is used to remove an existing restart.
It was causing patroni failing to stop after receiving SIGTERM.
Acceptance tests was killing it with SIGKILL which was causing further tests fail because postgres was still running:
2016-06-16 14:36:24,444 INFO: no action. i am the leader with the lock
2016-06-16 14:36:25,448 INFO: Lock owner: postgres0; I am postgres0
2016-06-16 14:36:25,452 ERROR: Failed to update /service/batman/optime/leader
Traceback (most recent call last):
File "/home/akukushkin/git/patroni/patroni/dcs/zookeeper.py", line 208, in write_leader_optime
self._client.retry(self._client.set, path, last_operation)
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/client.py", line 273, in _retry
return self._retry.copy()(*args, **kwargs)
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/retry.py", line 123, in __call__
return func(*args, **kwargs)
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/client.py", line 1219, in set
return self.set_async(path, value, version).get()
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/handlers/utils.py", line 74, in get
self._condition.wait(timeout)
File "/usr/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
File "/home/akukushkin/git/patroni/patroni/utils.py", line 219, in sigterm_handler
sys.exit()
SystemExit
2016-06-16 14:36:25,453 INFO: no action. i am the leader with the lock
2016-06-16 14:36:26,443 INFO: Lock owner: postgres0; I am postgres0
2016-06-16 14:36:26,444 INFO: no action. i am the leader with the lock
These parameters usually must be the same across all cluster nodes and
therefore must be set only via global configuration and always passed as
a list of postgres arguments (via pg_ctl) to make it not possible
accidentally change them by 'ALTER SYSTEM'