735 Commits

Author SHA1 Message Date
Feike Steenbergen
dd5bc1bc9b Merge branch 'master' into feature/replica-info 2016-08-24 11:55:33 +02:00
Alexander Kukushkin
96da6340a9 Calculate future restart time dynamically (#268)
`do_POST_restart` was ramdomly showing not 100% coverage after 2016-08-20 due to hardcoded timestamps.
2016-08-24 09:46:56 +02:00
Feike Steenbergen
1fc8b43b36 Return replication information on the api
To enable better monitoring, it is useful to have replication statistics.
Addresses issue #261
2016-08-24 09:31:49 +02:00
Alexander Kukushkin
fa7aa71092 Always call on_start callback when starting Patroni (#262)
When Patroni was "joining" already running postgres it was not calling
callbacks, what in some cases causing issues (callback could be used to
change routing/load-balancer or assign/remove floating (service) ip.

In addition to that we should `start` postgres instead of `restart`-ing
it when doing recovery, because in this case 'on_start' callback should
be called, instead of 'on_restart'
2016-08-18 09:35:13 +02:00
Oleksii Kliukin
179131893e Merge branch 'master' into feature/ctl_scaffolding 2016-08-10 11:49:08 +02:00
Alexander Kukushkin
8ef7178ddf Refactor code dealing with database connection string/params (#255)
In the original code we were parsing/deparsing url-style connection
strings back and forth. That was not really resource greedy but rather
annoying. Also it was not really obvious how to switch all local
connections to unix-sockets (preferably).

This commit isolates different use-cases of working with connection
strings and minimizes amount of code parsing and deparsing them. Also it
introduces one new helper method in the `Member` object - `conn_kwargs`.
This method can accept as a parameter dict object with credentials
(username and password). As a result it returns dict object which could
be used by `psycopg2.connect` or for building connection urls for
pg_rewind, pg_basebackup or some other replica creation methods.

Params for local connection are builded in the `_local_connect_kwargs`
method and could be changed to unix-socket later easily.
2016-08-10 10:19:52 +02:00
Alexander Kukushkin
413a84836b Update etcd topology only after original request succeed (#254)
There is no point to try to update topology until original request is
not performed. Also for us it is more important to execute original
request rather then keep topology of etcd cluster in sync.

In addition to that implement the same retry-timeout logic in the
`machines` property which already is used in `api_execute` method.
2016-08-10 10:17:37 +02:00
Alexander Kukushkin
5fe74bec3b Make different kazoo timeouts depend on loop_wait (#243)
* Make different kazoo timeouts dependant on loop_wait

ping timeout ~ 1/2 * loop_wait
connect_timeout ~ 1/2 * loop_wait

Originally these values were calculated from negotiated session timeout
and didn't worked very well, because it was taking significant time to
figure out that connection is dead and reconnect (up to session timeout)
and not giving us time to retry.

* Address the code review
2016-08-10 10:15:09 +02:00
Murat Kabilov
a47a2bceff Manage scheduled restarts using patronictl (#248)
Manage scheduled restarts using patronictl
2016-08-09 12:54:48 +02:00
Oleksii Kliukin
ac7abfdd74 Minor fixes, address final rounds of code review. 2016-08-09 10:00:46 +02:00
Oleksii Kliukin
9fd01f6af4 Remove unused imports. 2016-08-08 16:48:14 +02:00
Oleksii Kliukin
d9102d2703 Remove the necessity of creating a RESTAPI object.
- We don't want to export RestApi object, since it initializes the
  socket and listens on it.
- Change get_dcs, so that the explicit scope passed to it will take
  priority over the one in the configuration file.
2016-08-08 16:15:57 +02:00
Oleksii Kliukin
53f991df0f More code-review related fixes
- Add missing delete_cluster.
- Simplify parts of the code by removing exception handlers where
  they are not needed.
- Fix typos.
2016-08-08 15:30:33 +02:00
Oleksii Kliukin
eeb8f1b694 Further address code reviews.
- Fix the issue in ctl that would result in setting the  listen_address to True.
- Minor stylistic issues.
- Add unit-tests.
2016-08-08 12:21:01 +02:00
Oleksii Kliukin
13b4306f40 Remove one more occurrence of the time bomb 2016-07-14 16:53:02 +02:00
Oleksii Kliukin
6c9ffa4d3c Address the code review
In particular, replace the fixed dates for the future actions
in the unit tests with those that depend on the current date,
avoiding the "timebomb" effect.
2016-07-14 16:39:35 +02:00
Oleksii Kliukin
3181c4e59f Code review, asynchronous restarts.
- Make the restart initiated by the schedule asynchronous
- Fix the placeholders in logs.
- Fix the regexp to detect the PostgreSQL version.
2016-07-12 20:25:01 +02:00
Oleksii Kliukin
b17483b7dd Fix the PG version regex. 2016-07-11 15:21:31 +02:00
Oleksii Kliukin
c91eda8d78 Merge branch 'master' into feature/scheduled_restarts 2016-07-11 12:56:24 +02:00
Oleksii Kliukin
6da2eecb90 Increase the test coverage. 2016-07-11 11:51:07 +03:00
Alexander Kukushkin
659f7617f5 New option: remove_data_directory_on_rewind_failure
One more try to fix pg_rewind
2016-07-05 12:11:15 +02:00
Oleksii Kliukin
8834f929aa Improve the unit tests/coverage. 2016-07-05 10:07:29 +02:00
Alexander Kukushkin
a19dbfaddf Merge pull request #232 from zalando/bugfix/pg_rewind
Start readonly when holding leader lock
2016-07-04 13:11:35 +02:00
Alexander Kukushkin
b84e22c4ea Implement more checks in the follow method
Although such situation should not happen in reality (follow method is
not supposed to be called when when the node is holding leader lock and
postgres is running), but to be on the safe side it is better to
implement as much checks as possible, because this method could
potentially remove data directory.
2016-07-04 10:56:37 +02:00
Alexander Kukushkin
f9298d30ca Merge pull request #231 from zalando/bugfix/etcd-retry
Fix retry logic in etcd.py
2016-07-04 10:37:55 +02:00
Alexander Kukushkin
f7c6bd4eab Implement different connect strategy for zookeeper
Originally it was trying to connect during session_timeout time.
Such strategy doesn't work good during short network hiccups...
2016-07-01 12:31:29 +02:00
Alexander Kukushkin
ee529669d2 Start readonly when holding leader lock
Not starting of postgres was causeing situation when there were no
master running...
2016-07-01 12:28:02 +02:00
Alexander Kukushkin
dc27a30800 Merge pull request #230 from zalando/bugfix/pg_rewind
Try to cover as much as possible pg_rewind corner-cases
2016-06-30 12:09:10 +02:00
Alexander Kukushkin
aa10f42913 checkpoint method returns string status message 2016-06-30 10:45:54 +02:00
Alexander Kukushkin
876cfdfb2d Fix retry logic in etcd.py
Client class takes care about retrying when connection to the etcd node
fails. It calculates amount of retries and timeout depending on etcd
cluster size.

Etcd class should not retry when EtcdConnectionFailed exception is
raised (this case is already handled in the Client).

Besides that adjust retry timeouts in the Client class.
2016-06-29 15:30:54 +02:00
Alexander Kukushkin
4b67008488 Try to cover as much as possible pg_rewind corner-cases
rewind is not possible when:
1) trying to rewind from themself
2) leader is not reachable
3) leader is_in_recovery

All these cases were leading to removing of data directory...
In all cases except 1) it should "retry" when leader will became
available and not is_in_recovery.
2016-06-29 14:29:31 +02:00
Alexander Kukushkin
ae88e7c96e Document that every single zookeeper host:port MUST be quoted
otherwise yaml library can not parse the list.
And make visible yaml exception when trying to parse this list.
2016-06-29 14:25:50 +02:00
Oleksii Kliukin
d2832ee43b Address the code review.
Fix return  value in the should_run_scheduled_action and the comments.
Correct the json composition in the scheduled_restart test.
Fix the delete in case there is no scheduled restart.
Fix the usage of format in the logger output.
Fix the indentation in the evaluate_scheduled_restart.
Fix the condition related to the body_is_optional in the do_POST_restart.
Fix a few typos in the error messages.
Fix the _read_json_content
Make the scheduled restart unit-tests a bit less ugly
2016-06-28 16:54:20 +02:00
Alexander Kukushkin
0318749b56 bugfix: api must report role=master during pg_ctl stop
In addition for that make pg_ctl --timeout option configurable.
If the stop or start didn't succeeded during given timeout when demoting
master, role will be forcibly changed to 'unknown' and all needed
callbacks executed.
2016-06-28 14:14:42 +02:00
Oleksii Kliukin
568eb730bc Clear the scheduled restart after the normal one.
Make sure the scheduled restart flag is cleared when the
postmaster_start_time changes since the time restart was scheduled.

Additionally, separate the logic of checking the restart conditions
into the function in order to support conditions for the normal
restart as well.
2016-06-24 17:39:04 +02:00
Oleksii Kliukin
29845dd383 Restart the node according to the schedule.
The scheduled restart data structures are now independent of those
used by the normal restarts. This would be fixed in subsequent
commits.
Add the behave tests, that cover the POST /restart (but not DELETE).
2016-06-23 10:43:54 +02:00
Oleksii Kliukin
c2490d4831 Merge branch 'master' into feature/scheduled_restarts 2016-06-20 15:38:20 +02:00
Oleksii Kliukin
318ca6be38 Implement scheduling and deleting a restart.
The scheduled restart API extends the already existing restart
endpoint by processing the parameters in the request body.

Only one scheduled restart at a time is support. DELETE method
on the /restart endpoint is used to remove an existing restart.
2016-06-20 15:16:22 +02:00
Alexander Kukushkin
bd1e658080 Bugfix: obviously sys.hexversion was one symbol shorter
plus remove some unneeded code
2016-06-17 12:18:41 +02:00
Alexander Kukushkin
bd5440a102 Fix a typo and call sys.exit on sigterm
otherwise it will wait up to `loop_wait` seconds berfore exiting...
2016-06-16 15:19:21 +02:00
Alexander Kukushkin
69099b060e SystemExit exception was swallowed in in thread
It was causing patroni failing to stop after receiving SIGTERM.
Acceptance tests was killing it with SIGKILL which was causing further tests fail because postgres was still running:
2016-06-16 14:36:24,444 INFO: no action.  i am the leader with the lock
2016-06-16 14:36:25,448 INFO: Lock owner: postgres0; I am postgres0
2016-06-16 14:36:25,452 ERROR: Failed to update /service/batman/optime/leader
Traceback (most recent call last):
  File "/home/akukushkin/git/patroni/patroni/dcs/zookeeper.py", line 208, in write_leader_optime
    self._client.retry(self._client.set, path, last_operation)
  File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/client.py", line 273, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/retry.py", line 123, in __call__
    return func(*args, **kwargs)
  File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/client.py", line 1219, in set
    return self.set_async(path, value, version).get()
  File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/handlers/utils.py", line 74, in get
    self._condition.wait(timeout)
  File "/usr/lib/python2.7/threading.py", line 340, in wait
    waiter.acquire()
  File "/home/akukushkin/git/patroni/patroni/utils.py", line 219, in sigterm_handler
    sys.exit()
SystemExit
2016-06-16 14:36:25,453 INFO: no action.  i am the leader with the lock
2016-06-16 14:36:26,443 INFO: Lock owner: postgres0; I am postgres0
2016-06-16 14:36:26,444 INFO: no action.  i am the leader with the lock
2016-06-16 14:59:13 +02:00
Alexander Kukushkin
17f317665f Merge pull request #221 from zalando/feature/patronictl-auth
patronictl will send authorization header if it is configured
2016-06-16 12:57:14 +02:00
Alexander Kukushkin
010a2961cb Merge pull request #220 from zalando/feature/patronictl-newconf
Feature/patronictl newconf
2016-06-16 12:56:47 +02:00
Alexander Kukushkin
9f5276dd2b patronictl will send authorization header if it is configured
username:password can be configured in the 'restapi' section of config
file or via environment
2016-06-16 12:16:16 +02:00
Alexander Kukushkin
bd6070e2b0 Make patronictl use config.py for loading config_file
config.py is not only loading config_file but also can build
configuration from environment variables.
2016-06-16 08:50:44 +02:00
Alexander Kukushkin
57807ff337 Don't expose replication user/passwd in DCS 2016-06-15 09:34:04 +02:00
Alexander Kukushkin
f2980b13fb Merge pull request #211 from zalando/feature/environment-configuration
Implement possibility to configure Patroni via environment
2016-06-14 10:10:09 +02:00
Alexander Kukushkin
c64170ef33 Extend list of postgres parameters controlled by Patroni
These parameters usually must be the same across all cluster nodes and
therefore must be set only via global configuration and always passed as
a list of postgres arguments (via pg_ctl) to make it not possible
accidentally change them by 'ALTER SYSTEM'
2016-06-13 10:33:14 +02:00
Alexander Kukushkin
9ecff0f64d Bugfixes
* GET /config was returning latesy "correct" version of dynamic
  configuration.
* PATCH /config was breaking when trying to patch not dict with dict
2016-06-10 12:35:04 +02:00
Alexander Kukushkin
49efb371f9 Make it possible to work without config.yml
Most of the basic configuration could be done via ENV
2016-06-09 14:44:29 +02:00