Commit Graph

48 Commits

Author SHA1 Message Date
Alexander Kukushkin
680444ae13 Reduce lock time taken by dcs.get_cluster() (#989)
`dcs.cluster` and `dcs.get_cluster()` are using the same lock resource and therefore when get_cluster call is slow due to the slowness of DCS it was also affecting the `dcs.cluster` call, which in return was making health-check requests slow.
2019-03-12 22:37:11 +01:00
Alexander Kukushkin
9bf074acfb Compatibility with python3 (#883)
Change of `loop_wait` was causing Patroni to disconnect from zookeeper and never reconnect back. The error was happening only with python3 due to a difference in implementation of `select.select` function.
2018-11-30 11:40:34 +01:00
Alexander Kukushkin
fb01aaebc5 Compatibility with kazoo-2.6.0 (#872)
Recently 2.6.0 was release which changes the way how create_connection method is called. Before it was passing two arguments, and in the new version all argument names are specified explicitly.
2018-11-19 14:26:20 +01:00
Alexander Kukushkin
4ca8a6e506 Make retries of calls to DCS consistent across implementations (#805)
in addition to that do a small refactoring of zookeeper and consul and try to improve the stability of AT
2018-09-06 08:37:26 +02:00
Alexander Kukushkin
03c2a85d23 Expose current timeline in DCS and via API (#591)
It is very easy to get current timeline on the master by executing
```sql
SELECT ('x' || SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int
```

Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master.

Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are:
* Second column is the absolute wal position in bytes, instead of LSN
* Optionally there might be a fourth column - timestamp, (mtime of history file)
2018-01-05 15:25:56 +01:00
Alexander Kukushkin
4328c15010 Make Patroni Kubernetes native (#500)
* Use ConfigMaps or Endpoins for leader elections and to keep cluster state
* Label pods with a postgres role
* change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`
2017-12-08 16:55:00 +01:00
Alexander Kukushkin
038b5aed72 Improve leader watch functionality (#356)
Previously replicas were always watching for leader key (even if the
postgres was not in the running there). It was not a big issue, but it
was not possible to interrupt such watch in cases if the postgres
started up or stopped successfully. Also it was delaying update_member
call and we had kind of stale information in DCS up to `loop_wait`
seconds. This commit changes such behavior. If the async_executor is
busy by starting/stopping or restarting postgres we will not watch for
leader key but waiting for event from async_executor up to `loop_wait`
seconds. Async executor will fire such event only in case if the
function it was calling returned something what could be evaluated to
boolean True.

Such functionality is really needed to change the way how we are making
decision about necessity of pg_rewind. It will require to have a local
postgres running and for us it is really important to get such
notification as soon as possible.
2016-11-22 16:22:30 +01:00
Ants Aasma
7e53a604d4 Add synchronous replication support. (#314)
Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions.

To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over.

We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions.

On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS.

When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock.

Added acceptance tests and documentation.

Implementation by @ants with extensive review by @CyberDem0n.
2016-10-19 16:12:51 +02:00
Alexander Kukushkin
5fe74bec3b Make different kazoo timeouts depend on loop_wait (#243)
* Make different kazoo timeouts dependant on loop_wait

ping timeout ~ 1/2 * loop_wait
connect_timeout ~ 1/2 * loop_wait

Originally these values were calculated from negotiated session timeout
and didn't worked very well, because it was taking significant time to
figure out that connection is dead and reconnect (up to session timeout)
and not giving us time to retry.

* Address the code review
2016-08-10 10:15:09 +02:00
Alexander Kukushkin
f7c6bd4eab Implement different connect strategy for zookeeper
Originally it was trying to connect during session_timeout time.
Such strategy doesn't work good during short network hiccups...
2016-07-01 12:31:29 +02:00
Alexander Kukushkin
5f4e582660 Merge branch 'master' of github.com:zalando/patroni into feature/dynamic-configuration 2016-06-09 11:04:28 +02:00
Alexander Kukushkin
50d118c3aa Split ZooKeeper and Exhibitor
Originally Exhibitor was supported in the ZooKeeper class and
configuration for Exhibitor was taken also from `zookeeper` section in
the yaml config file. In fact, Exhibitor just extends ZooKeeper and now
it is reflected in the code and also Exhibitor got it's own section in
the config.yaml file. It will make it easier to configure Exhibitor
hosts and port via environment variables when PR#211 will be merged.
2016-06-08 19:21:18 +02:00
Alexander Kukushkin
b3ada161cf Implement possibility to configure retry_timeout globally
Previously it was hardcoded all over the place.
2016-05-31 10:30:53 +02:00
Alexander Kukushkin
7827951c8c Dynamic configuration 2016-05-25 14:17:05 +02:00
Alexander Kukushkin
6104d688d9 Merge branch 'master' of github.com:zalando/patroni into feature/sighup 2016-05-19 14:27:04 +02:00
Alexander Kukushkin
0c2aad98a3 Move dcs implementations into dcs package 2016-05-19 10:57:18 +02:00
Alexander Kukushkin
1741fa7e0f Mininize number of references to dcs implementations from tests
where it is not necessary (test_ha, test_ctl, etc...)
It will simplyfy further refactoring and make it possible to install
implementations of AbstractDCS independant of each other.
2016-05-19 10:00:32 +02:00
Alexander Kukushkin
d422e16aad Implement reload of config.yaml on SIGHUP
If some changes require restart of postgres patroni will expose
`restart_pending` flag in DCS and via REST API
2016-05-13 13:31:21 +02:00
Alexander Kukushkin
0d3dca56ff In some cases Ha.cluster can be None after calling get_cluster
Such situation is causing patroni crash. Usually it was happening during
manual failover, after former master has demoted and `reset_cluster`
method has been called. In this case `fetch_cluster` was `False` and
`_load_cluster` method was returning value from `self._cluster`, which
was `None`.
2016-03-24 12:06:39 +01:00
Alexander Kukushkin
3a7d2c3874 Remove unused code from unit tests 2016-03-21 20:48:17 +01:00
Alexander Kukushkin
54055c1ff8 Rename ambiguous Failover.member to candidate
But! 'member' is still accepted by REST API and also name 'member' is
used to strore/read this value to/from DCS (for backward comatibility)
2016-03-18 15:59:47 +01:00
Alexander Kukushkin
0e0c8ed8d7 Implement delete_cluster interface in for all available dcs
In addition to that rename confusing `Etcd.client` and
`ZooKeeper.client` into `_client`. This attribute is available from
AbstractDCS and people had wrong impression that it provides the same
interface for different DCS implementations, which is obviously not the
case. For Etcd it has type etcd.Client and for ZooKeeper - KazooClient.
2016-03-15 16:25:48 +01:00
Alexander Kukushkin
df9b8fed2e Improve quality of code by resolving issues found by quantifiedcode and codacy 2016-02-12 12:23:49 +01:00
Alexander Kukushkin
a6603e8b48 bugfix in zookeeper module:
when master node was being attached to patroni/zookeeper (no cluster in
zookeeper yet) patroni has never tried to "refetch" cluster from DCS.
It was leeding to demote...
2015-10-08 13:07:38 +02:00
Alexander Kukushkin
8a844285ff Set fetch_cluster flag to False when _inner_load_cluster called
Set the same flag to True if the cluster does not yet exists in
ZooKeeper
2015-10-07 16:48:39 +02:00
Alexander Kukushkin
d8f4b09478 use Event.wait instead of sleep
it makes possible to break "sleep" for example from API

plus small bugfix: catch ValueError exception from json.loads
2015-10-02 10:26:48 +02:00
Alexander Kukushkin
d09875a056 refactoring:
1. run touch_member from the main loop
2. move code which takes care about long tasks into separate class
3. change format of data stored in a DCS: use json instead of url
4. change Member class: from now it deserialize everything into data property
5. rework API: from now it takes into account state of the current node in a dcs
2015-10-01 17:06:42 +02:00
Alexander Kukushkin
c218054d05 Implement manual failover
Implementation is done on top of feature/is-healthiest-via-api and
feature/api branches.
In order to trigger manual failover one has to create 'failover' key in
a configuration store with the value in following format:
'leader_name:member_name'
leader_name can be empty or should match with the name of current leader
member_name can be empty or should match with the name one of cluster
nodes
Leader always checks that either desired member (if specified) or one of
the memners is accessible and healthy before demote.
After leader has deomted himself other nodes are performig checks that
desired node is healthy. If it is not they are participating in a leader
race. In some cases (when accidently there is no healthy nodes) former
leader can also participate in a leader race.

Current implementation does not provide REST API endpoint for a manual
failover.
2015-09-28 17:00:42 +02:00
Alexander Kukushkin
6e9cb60fd5 Restart and reinitialize via api
POST /restart -- will restart postgres
You you are restartung leader node, lock would be maintained during
restart.

POST /reinitialize -- will reinitialize node from the leader.
It's not possible to reinitialize current leader.
Command will fail when the leader is unknown.
2015-09-24 14:52:03 +02:00
Alexander Kukushkin
d8982e1e5a Refactor Postgresql.query method to use common retry mechanism
query method in an api.py also needs retry in some cases (for example
when we are running is_healthiest_node check).
In all cases we should retry only when connection is closed or broken.
BUT, the connection status must be checked via cursor.connection (old
implementation was using general connection object for that). For
multi-threaded applications this is not appropriate, because some other
thread might restore connection.

In addition to that I've changed most of the unit tests to use `Mock` and
`patch` where it is possible.
2015-09-20 13:54:30 +02:00
Alexander Kukushkin
7f8e95b334 Next run of ha cycle is rescheduled depending on return value of watch
Current etcd implementation does not yet support timeout option when
`wait=true`: https://github.com/coreos/etcd/issues/2468

Originaly I've implemented `watch` method for `Etcd` class in a
following manner: if the leader key was updated just because master
needs to update ttl and watch timeout is not yet expired, I was
recalculating timeout and starting `watch` call once again.
Usually after "restart" we were getting urllib3.exceptions.TimeoutError.
The only possible way to recover after such exception - close socket and
establish a new connection. With pure http it's relatively cheap, but
with https and some kind of authorization on etcd side it would became
rather expensive and should be avoided.
2015-09-16 10:38:34 +02:00
Alexander Kukushkin
90cfcf0c14 Make zookeeper module compatible with python3 2015-09-14 17:14:39 +02:00
Alexander Kukushkin
209c985420 get_node and get_children should catch only NoNodeError exception.
All other exceptions are needed to have retry functionality working
correctly.
2015-09-14 11:45:00 +02:00
Alexander Kukushkin
f494d2ce64 Build Cluster object for ZooKeeper the same way as for Etcd
Previous implementation was always setting Cluster.initialize to True.
Also it was throwing ZooKeeperError when there were no members in a
cluster.

Plus BUGFIX of a bug introduced with
https://github.com/zalando/patroni/pull/34 in a `load_members` method.
- data = self.get_node(self.member_path)
+ data = self.get_node(self.members_path + member)
It was always fetching the same node for all cluster members.
Fortunately Etcd doesn't have such problem because we are fetching the
whole cluster directory with one recursive API call.
2015-09-14 11:19:46 +02:00
Oleksii Kliukin
2377c417e4 Fix etcd and zookeper interactions with initialize key.
Fix unittests as well.
2015-09-10 16:05:10 +02:00
Oleksii Kliukin
938b946e55 Merge branch 'master' into feature/cleanup_on_failed_initialization 2015-09-10 15:43:31 +02:00
Alexander Kukushkin
36cbd34ffc Fix zookeeper test coverage 2015-09-09 15:59:02 +02:00
Oleksii Kliukin
ff499604f0 Act on removal of initialization flag.
If initializer node suddenly dies before the initialization is complete,
other nodes should try to take over.

Fix some unittests for etcd and zookeeper and add couple of new ones.
2015-09-08 16:04:54 +02:00
Oleksii Kliukin
92647b7aad Merge branch 'master' of https://github.com/zalando/patroni into feature/cleanup_on_failed_initialization 2015-09-08 14:54:52 +02:00
Oleksii Kliukin
b842ed478b Make sure initialize flag is reset on failure.
Cleanup the initialize flag if the initializing node fails
to bootstrap its PostgreSQL database.

Rename dcs.race to initialize, since we only call it for the
initialize flag. Factored out PostgreSQL bootstrapping code
into a separate function.
2015-09-08 12:03:34 +02:00
Alexander Kukushkin
1774d6e31a Merge branches watch-leader-key and package-refactoring 2015-09-05 16:09:28 +02:00
Alexander Kukushkin
650e244904 Refactor directory structure in preparation for building pypi-package 2015-09-04 16:06:44 +02:00
Alexander Kukushkin
10c95a23e4 Rename sleep to watch in a AbstractDCS
This method suppose to watch for changes of leader key if current node
is not leader and also it could watch for changes in a members list if
current conde is the leader.
2015-09-01 09:59:37 +02:00
Alexander Kukushkin
3b1efff53e Refactor Cluster object
`Cluster.leader` is not reference to `Member` anymore, but to `Leader`
`Leader` class contains field `index` (update index). This field is very
useful for watching for events which changing leader key. Also `Leader`
contains `member` field, which should reference real member.
2015-08-27 10:53:22 +02:00
Alexander Kukushkin
befd33555d Refactor helpers/etcd.py
Work with etcd cluster via high-level python-etcd module.
Plus change all unit tests accordingly.
2015-08-24 16:58:08 +02:00
Alexander Kukushkin
dcad7a3229 Add exhibitor support
List of ZooKeeper nodes could be periodically updated from Exhibitor
Since we know that each Exhibitor accompanies one ZooKeeper node, list
of Exhibitor nodes also maintained. Exhibitor assumes that all ZooKeeper
nodes are using the same client port, 2181. The same assumption is valid
for Exhibitor, it should always listen on the same port on all nodes.

Original list of Exhibitor nodes is cached and used as a fallback when
it failed ito query information with using maintained list.
2015-07-10 10:46:33 +02:00
Alexander Kukushkin
c49580d6a7 Rename governor into patroni 2015-07-08 10:37:35 +02:00
Alexander Kukushkin
43b12af3a7 Implement possibility to work against ZooKeeper
This implementation is using the same interface (AbstractDCS) as Etcd
class. It means that there should be no problem to implement another
plugin to work agains Consul for example.
2015-07-07 12:45:14 +02:00