1475 Commits

Author SHA1 Message Date
Oleksii Kliukin
84d804e579 Release notes 1.4 (#597)
Document  Kubernetes parameters, environment variables. Describe how Patroni uses Kubernetes.
v1.4
2018-01-10 11:17:08 +01:00
Alexander Kukushkin
d1312a7ce4 Do not try to load history file when timeline=1 (#596)
00000001.history doesn't exists
2018-01-09 12:01:14 +01:00
Oleksii Kliukin
d14d9f669a Document pip-related installation options. (#595)
* Remove redundant requirements of Mac OS.

* Clarify how to run the example in getting started.
2018-01-08 13:59:31 +01:00
Alexander Kukushkin
5668367181 Implement '/sync' and /async endpoints (#578)
They will respond with http status code 200 only when the node is running as a synchronous or asynchronous replica.

Fixes https://github.com/zalando/patroni/issues/189
Fixes https://github.com/zalando/patroni/issues/415
2018-01-05 15:28:40 +01:00
Alexander Kukushkin
03c2a85d23 Expose current timeline in DCS and via API (#591)
It is very easy to get current timeline on the master by executing
```sql
SELECT ('x' || SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int
```

Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master.

Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are:
* Second column is the absolute wal position in bytes, instead of LSN
* Optionally there might be a fourth column - timestamp, (mtime of history file)
2018-01-05 15:25:56 +01:00
Alexander Kukushkin
18786464a1 Rename failover to switchover and make new failover work without leader (#588)
In addition to that implement /switchover endpoint as an alias to /failover endpoint and implement more checks like:
* candidate must be provided for a failover
* switchover can't be scheduled in a pause state
* and so on

Fixes https://github.com/zalando/patroni/issues/585
Fixes https://github.com/zalando/patroni/issues/520
2018-01-05 15:17:56 +01:00
Alexander Kukushkin
3a96ffa718 Expose pause state of every member to DCS and via REST (#592)
and implement patronictl pause|resume --wait on top of that

Fixes https://github.com/zalando/patroni/issues/349
2018-01-05 15:16:45 +01:00
Alexander Kukushkin
6b01d2787f More improvements in patronictl (#590)
Make specifying cluster_name optional for some more commands.
If it is not specified, it's value would be taken from config file.
2018-01-04 12:26:13 +01:00
Alexander Kukushkin
2b8618b027 Minimize amount of SELECTS issued by Patroni on every loop (#584)
Every iteration of HA loop Patroni needs to call pg_is_in_recovery() and calcualte absolute wal_position. It was doing two separate SELECT statements for that. In case of master it was doing even three queries (wal_position two times).
We will issue one SELECT for every HA loop and cache the results.
2018-01-04 11:17:43 +01:00
Ants Aasma
15d1767402 Some improvements to patronictl (#571)
* Use scope from config file when listing members

* Add version command to patronictl

* Only delete leader on shutdown when we have the lock to avoid exceptions when leader key does not exist

* Add a timestamp option to list command.

* YAML format for patronictl output

* Fix API request to get version
2018-01-04 10:35:22 +01:00
Alexander Kukushkin
0e01bb33bb Improve patronictl reinit (#576)
Make it possible to cancel a running task if you want to reinitialize replica.
There are two possible ways to trigger it:
1. patronictl will ask whether you want to cancel already running task if an attempt to trigger reinitialize has failed
2. if you are using `--force` argument with `patronictl reinit`
2018-01-04 10:31:44 +01:00
Alexander Kukushkin
b6425cab85 Allow to specify multiple hosts for etcd (#589)
This list will be used for initial discovery of etcd cluster members.
If for some reason during work this list of hosts has been exhausted (during work), Patroni will return to initial list.

In addition to that improve ipv6 compatibility by using a special function for splitting host and port.

Fixes https://github.com/zalando/patroni/issues/523
2018-01-04 10:25:06 +01:00
Alexander Kukushkin
84de53603f Update travis settings (#581)
* Add master branch and release tags to safelist
* Update build matrix: don't install python3.5 if running acceptance tests
2017-12-20 16:28:09 +01:00
Alexander Kukushkin
062c55f99c Update readthedocs config (#580)
* Get Patroni version from patroni/version.py
* Update copyright to match with the LICENSE file

Fixes https://github.com/zalando/patroni/issues/519
2017-12-20 14:28:12 +01:00
Alexander Kukushkin
fa5769468a Update python versions list (#577)
new travis image has 2.7 and 3.6 preinstalled by default
2017-12-19 15:35:13 +01:00
Alexander Kukushkin
7e72d1a75f Bump zookeeper version (#573)
3.4.9 can't be downloaded anymore and acceptance test with zookeeper/exhibitor fails
2017-12-08 18:40:11 +01:00
Alexander Kukushkin
4328c15010 Make Patroni Kubernetes native (#500)
* Use ConfigMaps or Endpoins for leader elections and to keep cluster state
* Label pods with a postgres role
* change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`
2017-12-08 16:55:00 +01:00
Alexander Kukushkin
bd847fd2cc Patronictl extended info (#567)
* Show information about scheduled failover and maintenance mode when showing list of cluster members. Fixes https://github.com/zalando/patroni/issues/557

* Fix postgres version check functions (postgres 10 and above compatibility) and apply pep8 formatting to the tests.
* Bump some configuration parameters to match with postgres 10 defaults.
* Fix name of contributor in release notes.
2017-11-28 12:10:05 +01:00
Ants Aasma
5da0e12353 Factor out postmaster process (#561)
Introduces a PostmasterProcess object that identifies a running process via pid and start time.
When pid file is parsed and the correct process identified this object is passed around.
When the process goes away we try to find a new one in case somebody restarted postgres behind our back.
2017-11-23 14:36:23 +01:00
Alexander Kukushkin
a89a902f4a Bump version and write release notes (#560)
and implement missing unit-tests
v1.3.6
2017-11-10 11:48:50 +01:00
Alexander Kukushkin
2e86fe5991 Consul dc (#559)
Make it possible to specify dc for consul as PATRONI_CONSUL_DC environment variable and update documentation accordingly.
2017-11-10 11:21:47 +01:00
Ants Aasma
7367b7c74a Verify process start time when checking if postgres is running. (#549)
After a crash that doesn't clean up postmaster.pid there could be a new process with the same pid resulting in a false positive for is_running(), which will lead to all kinds of bad behavior.

Fixes #548
2017-11-09 15:36:05 +01:00
ainlolcat
cfa957eb96 shutdown postgresql before bootstrap when we lost data directory (#553)
Tries to kill postgresql before bootstrap to prevent old process from interfering.
Fixes https://github.com/zalando/patroni/issues/542
2017-11-09 15:20:51 +01:00
V Aitvaras
ad7a1b8a16 Make it possible to provide datacenter configuration for Consul (#558)
```yaml
consul:
  url: http://consul.host:8500
  token: long-token-here
  dc: dev1-d1
```
2017-11-06 16:44:30 +01:00
Alexander Kukushkin
4daaf2beb0 Perform crash recovery in a single user mode if postgres died as master (#554)
But do it only if pg_rewind is enabled or there is no master at the moment.
Such "crash recovery" procedure was advised by Heikki Linnakangas
2017-11-03 16:22:39 +01:00
Alexander Kukushkin
8d926cbc86 Always send token in X-Consul-Token http header (#555)
Fixes https://github.com/zalando/patroni/issues/552
2017-11-03 16:22:07 +01:00
Alexander Kukushkin
823a4d6b8e Adjust session ttl if supplied value is smaller than minimum possible (#556)
It could happen that ttl provided in Patroni configuration is smaller
than minimum supported by Consul. In such case Consul agent fails to
create a new session and responds with 500 Internal Server Error and
http body contains something like: "Invalid Session TTL '3000000000',
must be between [10s=24h0m0s]". Without session Patroni is not able to
create member and leader keys in the Consul KV store and it means that
cluster becomes completely unhealthy.

As a workaround we will handle such exception, adjust ttl to the minimum
possible and retry session creation.

In addition to that make it possible to define custom log format via environment variable `PATRONI_LOGFORMAT`
2017-11-03 16:21:53 +01:00
Alexander Kukushkin
8e3511ca6b Different minor fixes (#551)
* Use unix line endings
* Make flake8 happy
2017-11-02 16:24:17 +01:00
Alexander Kukushkin
7c000f1519 Update releases.rst v1.3.5 2017-10-12 15:03:13 +02:00
Alexander Kukushkin
1e856e4ec6 Update release notes 2017-10-12 15:03:13 +02:00
Alexander Kukushkin
ae1a8f8942 Update release notes 2017-10-12 15:03:13 +02:00
Alexander Kukushkin
31d4d7878e Bump verions to 1.3.5 2017-10-12 15:03:13 +02:00
Alexander Kukushkin
34db670331 Improve test coverage 2017-10-12 15:03:13 +02:00
Alexander Kukushkin
94c52991e0 Set role to uninitialized if data directory was removed in runtime
Fixes https://github.com/zalando/patroni/issues/542
2017-10-12 15:03:13 +02:00
Alexander Kukushkin
8e9c62d002 Make it possible to change Consul session checks (#543)
If list of checks is not specified, Consul will use "serfHealth" in addition to TTL based created by Patroni.
There are some cases when people want to sacrifice fast detection of network partitioning in favor of ability to tolerate network lags.

Fixes https://github.com/zalando/patroni/issues/522
2017-10-12 15:01:31 +02:00
Alexander Kukushkin
cfdda23e27 Fix pg_rewind behaviour (#524)
When Patroni does calculation whether it should run pg_rewind or not, it relies on pg_controldata output or gets necessary information from replication connection.
On some cases (when for example postgres running as a master was killed), we can't use pg_controldata output immediately, but trying to start postgres. Such start could fail with the following errror:
```
LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
LOG,00000,"database system was interrupted; last known up at 2017-09-16 22:35:22 UTC",,,,,,,,,""
LOG,00000,"restored log file ""00000006.history"" from archive",,,,,,,,,""
LOG,00000,"entering standby mode",,,,,,,,,"" 2017-09-18 08:00:39.433 UTC,,,57,,59bf7d26.39,4,,2017-09-18 08:00:38 UTC,,0,LOG,00000,"restored log file ""00000006.history"" from archive",,,,,,,,,""
FATAL,XX000,"requested timeline 6 is not a child of this server's history","Latest checkpoint is at 29/1A000178 on timeline 5, but in the history of the requested timeline, the server forked off from that timeline at 29/1A000140.",,,,,,,,""
LOG,00000,"startup process (PID 57) exited with exit code 1",,,,,,,,,""
LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
LOG,00000,"database system is shut down",,,,,,,,,""
```
In this case controldata will still have `Database cluster state: in production`
All further attempts to start postgres will fail. Such situation could be fixed only if we start not in recovery. For safety we will do it in a single user mode.

The second problems is: if postgres was running as master, but later we started it and stopped, than pg_controldata will report:
```
Database cluster state:               shut down in recovery
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
```

And this info can't be used for calculations. In this case we should use
`Latest checkpoint location` and `Latest checkpoint's TimeLineID`
2017-09-29 14:21:19 +02:00
Ants Aasma
32b0768631 Fix watchdog on Python 3 (#531)
A misunderstanding of the ioctl() call interface. If mutable=False then fcntl.ioctl() actually returns the arg buffer back.
This accidentally worked on Python2 because int and str comparison did not return an error.
Error reporting is actually done by raising IOError on Python2 and OSError on Python3.

* Properly handle errors in set_timeout(), have them result in only a warning if watchdog support is not required.

* Improve watchdog device driver name display on Python3

* Eliminate race condition in watchdog feature tests.
  The pinged/closed states were not getting reset properly if the checks ran too quickly.
  Add explicit reset points in feature test so the check is unambiguous.
2017-09-29 10:27:10 +02:00
Alexander Kukushkin
8a584f7a61 Set pgpass explicitly to /tmp/pgpass0 when running unit-tests (#518)
If $HOME is set to a non-existing directory (which would e.g. be the case on an official Debian package autobuilder) some tests were failing
2017-09-12 16:07:20 +02:00
Alexander Kukushkin
3919b322f4 Release 1.3.4 (#515)
Fix documentation and update release notes
v1.3.4
2017-09-08 10:56:09 +02:00
Andrew Colin Kissa
53715e689a Pass the consul token as a header (#513)
Headers are now the prefered way to pass the token to the consul API - https://www.consul.io/api/index.html#authentication
2017-09-07 16:59:49 +02:00
Alexander Kukushkin
5ef01cfdfa Advanced configuration for Consul (#506)
* possibility to specify client certs and cacert
* possibility to specify token
* compatibility with python-consul-0.7.1
2017-08-24 07:56:12 +02:00
Alexander Kukushkin
4f87ea96ca "Could not take out TTL lock" message was never logged (#502)
This is not a critical bug, because `attempt_to_acquire_leader` method was still returning False in this case.
2017-08-24 07:55:30 +02:00
Alexander Kukushkin
23152a7fc4 synchronous_standby_names must be quoted with quote_ident (#505)
in addition to that implement additional checks around manual failover and recover when synchronous_mode is enabled

* Comparison must be case insensitive
2017-08-24 07:55:02 +02:00
Alexander Kukushkin
77aea03df9 Different bugfixes around pause state, mostly related to watchdog (#507)
* Do not send keepalives if watchdog is not active
* Avoid activating watchdog in a pause mode
* Set correct postgres state in pause mode
* Don't try to run queries from API if postgres is stopped
2017-08-24 07:53:32 +02:00
Alexander Kukushkin
4faec82380 Small bugfixes (#499)
* Short after promote synchronous replication was disabled even is synchronous_mode_strict is set
* Create empty pg_ident.conf if it is missing after restoring from backup
* Bump version
v1.3.3
2017-08-04 10:56:33 +02:00
francobellagamba
d374882356 Fixes #494 - Custom Bootrap Temp hba.conf (#496)
* Fixes #494
2017-08-01 13:56:40 +02:00
Alexander Kukushkin
25aa49b240 Run one manual failover test via rest API instead of patronictl
and bump Patroni version
v1.3.2
2017-07-31 11:18:01 +02:00
Alexander Kukushkin
322aa45e09 BUGFIX: patronictl edit-config didn't worked with zookeeper (#492)
When updating config key we should use `ClusterConfig.index` instead of
`ClusterConfig.modify_index`. The second one should be used by Patroni
internally to check that key was really changed, because when key is
deleted and recreated it's version always starts from the same value: 0

In addition to that use patronictl instead of http PATCH in some of
acceptance tests to change cluster config.

Fixes https://github.com/zalando/patroni/issues/491
2017-07-31 11:07:00 +02:00
Oleksii Kliukin
9f9acb6a55 Fix a watchdog unit test on OS X. v1.3.1 2017-07-28 16:45:29 +02:00
Alexander Kukushkin
f8b3703d6e Bugfix: failover via API didn't work due to change in _MemberStatus (#489)
Originally fetch_nodes_statuses was returning a tuple, later it was
wrapped into namedtuple _MemberStatus and recently _MemberStatus was
extened with watchdog_failed field, but api.py was still relying on
usual tuple and checking failover limitations on it's own instead of
calling `failover_limitation` method.
2017-07-28 15:38:55 +02:00