Commit Graph

38 Commits

Author SHA1 Message Date
Alexander Kukushkin
f1f2389146 A couple of small improvements in acceptance tests (#1057)
* Keep basebackup and wal_archive next to PGDATA in the data directory
* Test bootstrap of standby cluster nodes with custom scripts
2019-05-13 16:33:19 +02:00
Alexander Kukushkin
e38fe78b56 Fix callbacks behavior (mostly for standby cluster) (#998)
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.

In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.

Closes https://github.com/zalando/patroni/issues/988
2019-03-29 10:28:07 +01:00
Michael Banck
073074f83e Run coverage as python -m coverage (#968)
Depending on the platform the coverage binary might not always be available under the standard name.
2019-02-13 16:02:12 +01:00
Michael Banck
345e6d3131 Copy away output directories of failed acceptance tests. (#967)
And dump logs on travis from only failed features
2019-02-13 16:00:15 +01:00
Michael Banck
d01a9bdcd5 Change base port for acceptance tests from 5440 to 5360 (#966) 2019-02-13 15:59:13 +01:00
Alexander Kukushkin
381a5b80d2 Release 1.5.4 (#931)
* Bump version
* Update release notes
* Make it possible to configure registration of Service in Consul via env variables
2019-01-15 12:14:19 +01:00
Alexander Kukushkin
1a0876e5ca Refactor acceptance tests to improve stability (#884)
Hope it will crash less often when executed on travis against k8s
2018-11-30 12:40:56 +01:00
Alexander Kukushkin
2efd97baab Permanent replication slots (#819)
Permanent replication slots are preserved on failover/switchover, that is Patroni on the new primary will create configured replication slots right after doing promote.

Slots could be configured with the help of `patronictl edit-config`.
The initial configuration could be also done in the `bootstrap.dcs`

```yaml
slots:
  permanent_physical_1:
    type: physical
  permanent_logical_1:
    type: logical
    database: foo
    plugin: pgoutput
```

It is the responsibility of the operator to make sure that there are no clashes in names between replication slots automatically created by Patroni for members and permanent replication slots.

Closes https://github.com/zalando/patroni/issues/656
2018-10-31 11:37:42 +01:00
Alexander Kukushkin
e939304001 Take and apply some parameters from controldata when starting as replica (#703)
* Take and apply some parameters from controldata when starting as replica

https://www.postgresql.org/docs/10/static/hot-standby.html#HOT-STANDBY-ADMIN
There is set of parameters which value on the replica must be not smaller than on the primary, otherwise replica will refuse to start:
* max_connections
* max_prepared_transactions
* max_locks_per_transaction
* max_worker_processes

It might happen that values of these parameters in the global configuration are not set high enough, what makes impossible to start a replica without human intervention. Usually it happens when we bootstrap a new cluster from the basebackup.

As a solution to this problem we will take values of above parameters from the pg_controldata output and in case if the values in the global configuration are not high enough, apply values taken from pg_controldata and set `pending_restart` flag.
2018-06-12 14:04:32 +02:00
Alexander Kukushkin
4328c15010 Make Patroni Kubernetes native (#500)
* Use ConfigMaps or Endpoins for leader elections and to keep cluster state
* Label pods with a postgres role
* change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`
2017-12-08 16:55:00 +01:00
Alexander Kukushkin
8e3511ca6b Different minor fixes (#551)
* Use unix line endings
* Make flake8 happy
2017-11-02 16:24:17 +01:00
Ants Aasma
70d718a058 Simplify watchdog code (#452)
* Only activate watchdog while master and not paused

We don't really need the protections while we are not master. This way
we only need to tickle the watchdog when we are updating leader key or
while demotion is happening.

As implemented we might fail to notice to shut down the watchdog if
someone demotes postgres and removes leader key behind Patroni's back.
There are probably other similar cases. Basically if the administrator
if being actively stupid they might get unexpected restarts. That seems
fine.

* Add configuration change support. Change MODE_REQUIRED to disable leader eligibility instead of closing Patroni.

Changes watchdog timeout during the next keepalive when ttl is changed. Watchdog driver and requirement can also be switched online.

When watchdog mode is `required` and watchdog setup does not work then the effect is similar to nofailover. Add watchdog_failed to status API to signify this. This is True only when watchdog does not work **AND** it is required.

* Reset implementation when config changed while active.

* Add watchdog safety margin configuration

Defaults to 5 seconds. Basically this is the maximum amount of time
that can pass between the calls to odcs.update_leader()` and
`watchdog.keepalive()`, which are called right after each other. Should
be safe for pretty much any sane scenario and allows the default
settings to not trigger watchdog when DCS is not responding.

* Cancel bootstrap if watchdog activation fails

The system would have demoted itself anyway the next HA loop. Doing it
in bootstrap gives at least some other node chance to try bootstrapping
in the hope that it is configured correctly.

If all nodes are unable to activate they will continue to try until the
disk is filled with moved datadirs. Perhaps not ideal behavior, but as
the situation is unlikely to resolve itself without administrator
intervention it doesn't seem too bad.
2017-07-27 12:16:11 +02:00
Alexander Kukushkin
d5b3d94377 Custom bootstrap (#454)
Task of restoring a cluster from backup or cloning existing cluster into a new one was floating around for some time. It was kind of possible to achieve it by doing a lot of manual actions and very error prone. So I come up with the idea of making the way how we bootstrap a new cluster configurable.

In short - we want to run a custom script instead of running initdb.
2017-07-18 15:12:58 +02:00
Alexander Kukushkin
acc6d7c2c2 Watchdog unit-tests, bugfixes and questions (#449)
Implement missing unit-tests for and drop unused code
2017-07-11 10:00:30 +02:00
Alexander Kukushkin
681b6b507b Support unix sockets when connecting to a local postgres cluster (#457)
For backward compatibility this feature is not enabled by default. To enable it you have to set `postgresql.use_unix_socket: true`.
If feature is enable, and `unix_socket_directories` is defined and non empty, Patroni will use the first suitable value from it to connect to the local postgres cluster.
If the `unix_socket_directories` is not defined, Patroni will assume that default value should be used and will not pass `host` to command line arguments and omit it from connection url.

Solves: https://github.com/zalando/patroni/issues/61

In addition to mentioned above, this commit solves couple of bugs:
* manual failover with pg_rewind in a pause state was broken
* psycopg2 (or libpq, I am not really sure what exactly) doesn't mark cursors connection as closed when we use unix socket and there is an `OperationalError` occurs. We will close such connection on our own.
2017-06-22 11:47:57 +02:00
Ants Aasma
a70b46ef13 Add watchdog support on Linux (#343)
Ensures that system gets rebooted before TTL runs out.

Initial version. Open questions:

    Do we want to disable watchdog while we are not master?
2017-06-01 16:53:46 +02:00
Alexander Kukushkin
37c1552c0a Smart pg_rewind (#417)
Previously we were running pg_rewind only in limited amount of cases:
 * when we knew postgres was a master (no recovery.conf in data dir)
 * when we were doing a manual switchover to a specific node (no
   guaranty that this node is the most up-to-date)
 * when a given node has nofailover tag (it could be ahead of new master)

This approach was kind of working in most of the cases, but sometimes we
were executing pg_rewind when it was not necessary and in some other
cases we were not executing it although it was needed.

The main idea of this PR is first try to figure out that we really need
to run pg_rewind by analyzing timelineid, LSN and history file on master
and replica and run it only if it's needed.
2017-05-19 16:32:06 +02:00
Alexander Kukushkin
d138a8db17 AT for master_start_timeout + minor fixes (#361) 2016-12-09 12:02:41 +01:00
Alexander Kukushkin
1e573aec8f Do session/renew call to Consul when update_leader is called (#336) 2016-10-10 10:05:55 +02:00
Alexander Kukushkin
4594bc98da Increase timeouts when running AT on travis (#324)
* Increase timeouts two times when running AT on travis
* Make up to 3 attempts to download DCS
* Get rid from hard-coded names
2016-09-28 15:13:09 +02:00
Alexander Kukushkin
0b1bfeca5b Make sure that we are running and testing latest versions of everything (#303) 2016-09-19 13:32:53 +02:00
Alexander Kukushkin
ae88e7c96e Document that every single zookeeper host:port MUST be quoted
otherwise yaml library can not parse the list.
And make visible yaml exception when trying to parse this list.
2016-06-29 14:25:50 +02:00
Alexander Kukushkin
fcde17583c Acceptance tests for patronictl
Call patronictl.py when it's possible instead of doing REST API calls.
2016-06-16 15:06:18 +02:00
Alexander Kukushkin
5f4e582660 Merge branch 'master' of github.com:zalando/patroni into feature/dynamic-configuration 2016-06-09 11:04:28 +02:00
Alexander Kukushkin
50d118c3aa Split ZooKeeper and Exhibitor
Originally Exhibitor was supported in the ZooKeeper class and
configuration for Exhibitor was taken also from `zookeeper` section in
the yaml config file. In fact, Exhibitor just extends ZooKeeper and now
it is reflected in the code and also Exhibitor got it's own section in
the config.yaml file. It will make it easier to configure Exhibitor
hosts and port via environment variables when PR#211 will be merged.
2016-06-08 19:21:18 +02:00
Alexander Kukushkin
6700cd0aa6 Implement reload of config.yml with REST API call
and acceptance tests for that
2016-05-26 17:09:40 +02:00
Alexander Kukushkin
45cbc8ca70 Implement acceptance test for dynamic configuration functionality
and fix some bugs revealed by acceptance tests
2016-05-26 10:16:24 +02:00
Alexander Kukushkin
ceace03646 Address codacy and travis issues 2016-05-25 14:49:33 +02:00
Alexander Kukushkin
7827951c8c Dynamic configuration 2016-05-25 14:17:05 +02:00
Alexander Kukushkin
eabfd82a5d Implement Consul support 2016-04-27 10:59:01 +02:00
Alexander Kukushkin
fd4f12aac8 Do not assume that connection user is postgres, but take it from config.yml 2016-04-21 13:56:09 +02:00
Alexander Kukushkin
f8bf1bb0ab Disable sudo, reshuffle travis tasks and introduce caching
Without sudo travis is executing build tasks using docker and waiting
time in this case is really small, usually not longer then 10 seconds.

postgresql-9.5 is installed via addons.apt.packages (without sudo)
But ports 5432 and 5433 are busy. So I had to ajust environment.py to
assign port from higher diapason.

And a few words about build tasks:
First task is used for executing unit tests for all different python versions
The second one is used for executing acceptance tests against etcd
The third one is used for executing acceptance tests against zookeeper
acceptance tests are executed with python2.7 and python3.5

In addition that I've introduced caching of python virtual environment.
It really helps to reduce time needed to install python modules.
2016-04-13 13:32:39 +02:00
Alexander Kukushkin
24a2ea6cef Refactor acceptance tests to make them work against ZooKeeper
and make it easier to implement controllers for new DCS, i.e. consul
2016-04-10 10:37:43 +02:00
Alexander Kukushkin
79f4d9a13b Attempt to export acceptance tests coverage results to coveralls 2016-03-13 09:42:02 +01:00
Alexander Kukushkin
62f11ab747 Attempt to export acceptance tests coverage results to coveralls 2016-03-13 09:09:31 +01:00
Alexander Kukushkin
ba444adb67 make codacy and quantifiedcode happier 2016-03-11 15:32:16 +01:00
Alexander Kukushkin
5f6beae22f Enforce data-type checks for step matcher
and increase default timeout for patroni start
2016-03-11 14:46:14 +01:00
Alexander Kukushkin
30d3982d25 Acceptance tests with behave 2016-03-11 12:56:29 +01:00