1511 Commits

Author SHA1 Message Date
Alexander Kukushkin
1de7c78c04 Release 1.4.4 (#683)
bump version and update release notes
v1.4.4
2018-05-22 14:46:19 +02:00
Alexander Kukushkin
041015037e Sync replication slots when we noticed a new postmaster process (#677)
Fixes: https://github.com/zalando/patroni/issues/674
2018-05-18 16:32:06 +02:00
Alexander Kukushkin
856552bd61 Sync replication slots and verify sysid after coming out of pause (#678)
Fixes https://github.com/zalando/patroni/issues/568
and https://github.com/zalando/patroni/issues/674
2018-05-18 12:18:49 +02:00
Oleksii Kliukin
4ce539ba1b Allow options to the basebackup built-in method. (#604)
Options should be specified in the basebackup section, which is optional.
2018-05-18 12:18:35 +02:00
Oleksii Kliukin
1043376e6b Do not exit when encountering invalid system ID. (#669)
Do not exit when the cluster system ID is empty or the one that doesn't pass the validation check. In that case, the cluster most likely needs a reinit; mention it in the result message.
Avoid terminating Patroni, as otherwise reinit cannot happen.
2018-05-18 11:48:15 +02:00
Alexander Kukushkin
ed479fe585 Don't demote master if failed to update leader key in pause (#668)
Fixes https://github.com/zalando/patroni/issues/659
2018-05-18 11:19:56 +02:00
Alexander Kukushkin
5ce18a8045 Improve protection of DCS being accidentally wiped (#680)
We already have a lot of logic in place to prevent failover in such case and restore all keys, but an accidental removal of `/config` key was effectively switching off pause mode for 1 cycle of HA loop.
2018-05-18 11:18:58 +02:00
Alexander Kukushkin
5296336f4a BUGFIX: postmaster start can fail if pid from postmaster.pid is alive (#681)
Upon start postmaster process performs various safety checks if there is a postmaster.pid file in the data directory. Although Patroni already detected that the running process corresponding to the postmaster.pid is not a postmaster, the new postmaster might fail to start, because it thinks that postmaster.pid is already locked.
Important!!! Unlink of postmaster.pid isn't an option in this case, because it has a lot of nasty race conditions.
Luckily there is a workaround to this problem, we can pass the pid from postmaster.pid in the `PG_GRANDPARENT_PID` environment variable and postmaster will ignore it.

More likely to hit such problem if you run Patroni and postgres in the docker container.
2018-05-18 11:18:27 +02:00
Cody Coons
3eeb4ed979 Added check for empty subsets (#670)
On Kubernetes 1.10.0 I experienced an issue where calls to `patch_or_create` were failing when bootstraping a cluster. The call was failing because `self._leader_observed_subsets` was `None` instead of `[]`.
2018-04-26 16:38:19 +02:00
Alexander Kukushkin
84f29caf92 Fix race condition in poll_failover_result (#658)
It didn't affect directly neither failover nor switchover, but in some rare cases it was reporting it as a success too early, when the former leader released the lock: `Failed over to "None" instead of "desired-node"`

In addition to that this commit improves logs and status messages by differentiating between failover and switchover.
2018-04-16 17:45:05 +02:00
Alexander Kukushkin
d78790b194 Abort start if attaching to running postgres and cluster not initiazlied (#661)
Patroni can attach itself to an already running PostgreSQL instance. If that is the first instance "seen" in the given cluster, Patroni for that instance will create the initialize key, grab the leader key and, if the instance is running a replica, promote.

Because of this behavior, when a cluster with a master and one or more replicas gets Patroni for each node, it is imperative to start running Patroni on the master node before getting to the replicas.

This commit changes such weird behavior and will abort Patroni start if there is no initialize key in DCS and postgres is running as a replica.

Closes https://github.com/zalando/patroni/issues/655
2018-04-16 17:32:26 +02:00
Kostiantyn Nemchenko
3110090154 Minor corrections to the documentation. (#654) 2018-04-16 15:46:46 +02:00
Reinhard Nägele
20138af37a Link to official Helm chart (#660)
Changes the link from my outdated fork to the official Helm chart which is now up to date.
2018-04-16 15:45:53 +02:00
Dave Cramer
38ad394308 Use the word primary in favour of master (#663)
Primary is a better alternative.
2018-04-16 01:29:51 +02:00
Alexander Kukushkin
e375fac273 Treat postgres settings parameter names as case insensitive (#650)
Because they are indeed case insensitive.
Most of the parameters have snake_case_name, but there are three exceptions from this rule: DateStyle, IntervalStyle and TimeZone.
In fact, if you specify timezone = 'some/tzn' it still works, but Patroni wasn't able to find 'timezone' in pg_settings and stripping this parameter out.

We will use CaseInsensitiveDict to keep postgresql.parameters. This change affects only "final" configuration. That means if you put some"duplicates" (work_mem vs WORK_MEM) into patroni yaml or into cluster config, it would be resolved only at the last stage and for example you will be able to see both values if you use `patronictl edit-config`.

Fixes https://github.com/zalando/patroni/issues/649
2018-04-04 14:23:53 +02:00
Alexander Kukushkin
8c795ff0cf Pass dict object to touch_member instead of json encoded string (#651)
DCS implementation will take care about encoding it.
Fixes https://github.com/zalando/patroni/issues/642
2018-04-04 13:45:44 +02:00
Don Seiler
140618abd2 Missing a word (#647)
In re Issue #639
2018-04-04 13:40:46 +02:00
Josh Berkus
3c05e2e984 Added references to the Slack channel in Readme and in contributing.rst. (#653) 2018-04-04 13:39:43 +02:00
bradnicholson
ca679a93b8 Make deleting recovery.conf optional. (#638)
pgBackRest's restore command generates the appropriate recovery.conf based
on the parameters you provide to pgBackRest.  When calling pgBackRest's restore command
via Patroni's custom bootstrap, it deletes that recovery.conf.  Specifying the recovery.conf
information in the patroni.yml is less than ideal.  It prevent's leveraging pgBackRests
work to ensure recovery.conf files are properly generated.  It also can lead to transient
config data in the patroni.yml under certain restore cases, such as a PITR restore
of Cluster B to  Cluster A, where the restore_commnand in A needs to reference B.

The parameter is optional.  The default behavior is to delete the recovery.conf.

Fixes https://github.com/zalando/patroni/issues/637
2018-03-09 15:35:29 +01:00
Alexander Kukushkin
f500dbb0ff Release 1.4.3 (#635)
Bump version and update release notes
v1.4.3
2018-03-05 10:10:17 +01:00
Andy Newton
f748de3b29 Make log level configurable from environment variables (#622)
* `PATRONI_LOGLEVEL` - sets the general logging level
* `PATRONI_REQUESTS_LOGLEVEL` - sets the logging level for all HTTP requests e.g. Kubernetes API calls
2018-03-05 09:50:45 +01:00
Alexander Kukushkin
3afd26101b Single user mode was waiting for user input and never finish (#634)
Regression was introduced in https://github.com/zalando/patroni/pull/576
2018-03-02 22:22:43 +01:00
Alexander Kukushkin
c04e7a1798 Write bootstrap.pg_hba into a pg_hba.conf after custom bootstrap (#632)
Fixes https://github.com/zalando/patroni/issues/631
2018-02-26 18:48:56 +01:00
Alexander Kukushkin
89a11fed07 Don't rediscover etcd cluster topology when watch timed out (#630)
but switch to the next node if it is possible.

Fixes https://github.com/zalando/patroni/issues/628
2018-02-26 18:48:30 +01:00
Alexander Kukushkin
c95dd990cc Release 1.4.2 (#619)
* Bump version to 1.4.2
* Update release notes
v1.4.2
2018-01-30 16:44:42 +01:00
Alexander Kukushkin
dd1500b4dc Handle exceptions raised from psutil (#610)
Process.cmdline can raise `NoSuchProcess` or `AccessDenied`
Process.children can raise `AccessDenied`

Fixes https://github.com/zalando/patroni/issues/609
2018-01-30 16:26:39 +01:00
Alexander Kukushkin
0db788bb95 Make show-config work with cluster_name from config file (#618)
similar to edit-config, list and so on
2018-01-30 14:24:54 +01:00
Maciej Szulik
a4aaf53212 Add proper rbac to run patroni on k8s (#616)
Adds 3 resources that will properly setup the RBAC:

1. service account, which is also assigned to the pods of the cluster, so that they use those particular permissions 
2. a role, which holds only the necessary permissions that patroni members need to interact with k8s cluster
3. a rolebinding, which connects two two former things together to use.

The role and rolebinding was created using this tool https://github.com/liggitt/audit2rbac which looks at [audit logs](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#advanced-audit) provided  by the api server.
2018-01-30 12:00:49 +01:00
Maciej Szulik
3d293ac087 Change the pip definition in Dockerfile to use master now (#617) 2018-01-30 10:58:08 +01:00
Alexander Kukushkin
a0c8491abb Don't swallow silently all errors from k8s API (#611)
Output exception trace to the logs when http status code == 403, something is wrong with permissions.

When http status code == 409 -- such error could be ignored, because object probably was created or updated by another process.

For all other http status codes it will also produce stack traces.

I hope it will help to debug issues similar to the https://github.com/zalando/patroni/issues/606
2018-01-26 09:57:17 +01:00
Alexander Kukushkin
f8b6b21297 Avoid calling pg_controldata during bootstrap (#612)
Fixes https://github.com/zalando/patroni/issues/608
2018-01-26 09:56:17 +01:00
Alexander Kukushkin
9825b8a584 Improve patronictl list UX (#613)
* rename scheduled failover to scheduled switchover
* show information about pending_restarts
2018-01-26 09:55:54 +01:00
Alexander Kukushkin
8da05ad785 Update haproxy.tmpl (#614)
connection to postgres should be closed forcibly when health-check fails
2018-01-25 15:31:02 +01:00
Alexander Kukushkin
a1e5c8e1cb A few iprovements in patronictl (#601)
* make switchover work with an old patroni
* exclude leader from candidates when interactively running failover
v1.4.1
2018-01-17 15:33:08 +01:00
Oleksii Kliukin
4202ad853a Minor corrections to the documentation. (#599) 2018-01-10 16:10:12 +01:00
Alexander Kukushkin
93ac309b38 Fix link to the Kubernetes documentation (#598)
blog => blob
2018-01-10 13:19:23 +01:00
Oleksii Kliukin
84d804e579 Release notes 1.4 (#597)
Document  Kubernetes parameters, environment variables. Describe how Patroni uses Kubernetes.
v1.4
2018-01-10 11:17:08 +01:00
Alexander Kukushkin
d1312a7ce4 Do not try to load history file when timeline=1 (#596)
00000001.history doesn't exists
2018-01-09 12:01:14 +01:00
Oleksii Kliukin
d14d9f669a Document pip-related installation options. (#595)
* Remove redundant requirements of Mac OS.

* Clarify how to run the example in getting started.
2018-01-08 13:59:31 +01:00
Alexander Kukushkin
5668367181 Implement '/sync' and /async endpoints (#578)
They will respond with http status code 200 only when the node is running as a synchronous or asynchronous replica.

Fixes https://github.com/zalando/patroni/issues/189
Fixes https://github.com/zalando/patroni/issues/415
2018-01-05 15:28:40 +01:00
Alexander Kukushkin
03c2a85d23 Expose current timeline in DCS and via API (#591)
It is very easy to get current timeline on the master by executing
```sql
SELECT ('x' || SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int
```

Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master.

Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are:
* Second column is the absolute wal position in bytes, instead of LSN
* Optionally there might be a fourth column - timestamp, (mtime of history file)
2018-01-05 15:25:56 +01:00
Alexander Kukushkin
18786464a1 Rename failover to switchover and make new failover work without leader (#588)
In addition to that implement /switchover endpoint as an alias to /failover endpoint and implement more checks like:
* candidate must be provided for a failover
* switchover can't be scheduled in a pause state
* and so on

Fixes https://github.com/zalando/patroni/issues/585
Fixes https://github.com/zalando/patroni/issues/520
2018-01-05 15:17:56 +01:00
Alexander Kukushkin
3a96ffa718 Expose pause state of every member to DCS and via REST (#592)
and implement patronictl pause|resume --wait on top of that

Fixes https://github.com/zalando/patroni/issues/349
2018-01-05 15:16:45 +01:00
Alexander Kukushkin
6b01d2787f More improvements in patronictl (#590)
Make specifying cluster_name optional for some more commands.
If it is not specified, it's value would be taken from config file.
2018-01-04 12:26:13 +01:00
Alexander Kukushkin
2b8618b027 Minimize amount of SELECTS issued by Patroni on every loop (#584)
Every iteration of HA loop Patroni needs to call pg_is_in_recovery() and calcualte absolute wal_position. It was doing two separate SELECT statements for that. In case of master it was doing even three queries (wal_position two times).
We will issue one SELECT for every HA loop and cache the results.
2018-01-04 11:17:43 +01:00
Ants Aasma
15d1767402 Some improvements to patronictl (#571)
* Use scope from config file when listing members

* Add version command to patronictl

* Only delete leader on shutdown when we have the lock to avoid exceptions when leader key does not exist

* Add a timestamp option to list command.

* YAML format for patronictl output

* Fix API request to get version
2018-01-04 10:35:22 +01:00
Alexander Kukushkin
0e01bb33bb Improve patronictl reinit (#576)
Make it possible to cancel a running task if you want to reinitialize replica.
There are two possible ways to trigger it:
1. patronictl will ask whether you want to cancel already running task if an attempt to trigger reinitialize has failed
2. if you are using `--force` argument with `patronictl reinit`
2018-01-04 10:31:44 +01:00
Alexander Kukushkin
b6425cab85 Allow to specify multiple hosts for etcd (#589)
This list will be used for initial discovery of etcd cluster members.
If for some reason during work this list of hosts has been exhausted (during work), Patroni will return to initial list.

In addition to that improve ipv6 compatibility by using a special function for splitting host and port.

Fixes https://github.com/zalando/patroni/issues/523
2018-01-04 10:25:06 +01:00
Alexander Kukushkin
84de53603f Update travis settings (#581)
* Add master branch and release tags to safelist
* Update build matrix: don't install python3.5 if running acceptance tests
2017-12-20 16:28:09 +01:00
Alexander Kukushkin
062c55f99c Update readthedocs config (#580)
* Get Patroni version from patroni/version.py
* Update copyright to match with the LICENSE file

Fixes https://github.com/zalando/patroni/issues/519
2017-12-20 14:28:12 +01:00