Commit Graph

553 Commits

Author SHA1 Message Date
Julien Riou
663026c34c Use SSLContext to wrap REST API socket (#1039)
Using `ssl.wrap_socket` is deprecated and was still allowing soon-to-be-deprecated protocols like TLS 1.1.
Now using `SSLContext.create_default_context()` to produce a secure SSL context to wrap the REST API server's socket.
2019-04-23 11:23:22 +02:00
Alexander Kukushkin
51b085a76d Don't wait until the previous callback finish is kill failed (#1036)
Such wait was happening in the main thread and blocking HA loop.
After all the executor thread was doing absolutely the same.
2019-04-15 15:49:06 +02:00
Alexander Kukushkin
7c0c9599fc Remove psycopg2 from requirements (#1023)
Recently released psycopg2 split into two different packages, psycopg2, and psycopg2-binary which could be installed at the same time into the same place on the filesystem. In order to decrease dependency hell problem, we let a user choose how to install psycopg2. There are a few options available and it is reflected in the documentation.

This PR also changes the following behavior:
* `pip install patroni` will fail if psycopg2 is not installed
* Patroni will check psycopg2 upon start and fail if it can't be found or outdated.

Closes https://github.com/zalando/patroni/issues/1021
2019-04-15 14:30:16 +02:00
Pavlo Golub
b53a29c022 Fix unit-tests for Windows (#1014)
Closes #1013
2019-04-02 13:58:17 +02:00
Alexander Kukushkin
e38fe78b56 Fix callbacks behavior (mostly for standby cluster) (#998)
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.

In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.

Closes https://github.com/zalando/patroni/issues/988
2019-03-29 10:28:07 +01:00
Alexander Kukushkin
680444ae13 Reduce lock time taken by dcs.get_cluster() (#989)
`dcs.cluster` and `dcs.get_cluster()` are using the same lock resource and therefore when get_cluster call is slow due to the slowness of DCS it was also affecting the `dcs.cluster` call, which in return was making health-check requests slow.
2019-03-12 22:37:11 +01:00
Alexander Kukushkin
92720882aa Reset is_leader flag for every removal of leader key (#990)
This is the next improvement of #777
2019-03-12 22:10:46 +01:00
Alexander Kukushkin
13c88e8b7a Replace self-execute with multiprocessing.Process (#994)
In addition to that transfer postmaster pid to Patroni process with the help of multiprocessing.Pipe instead of using stdin-stdout pipes.

Closes https://github.com/zalando/patroni/issues/992
2019-03-12 10:40:37 +01:00
Alexander Kukushkin
4a4258fc3f Mock external resources (#995)
unit tests should not accidentally hit running Postgres, DCS or filesystem unless we want it explicitly.
2019-03-12 10:39:42 +01:00
Alexander Kukushkin
c64d51f79c Better support for static etcd cluster (#986)
if the `etcd.use_proxies` is set to true, Patroni will stick to the list of hosts specified in the `etcd.hosts` and avoid doing topology discovery. Such mode might be useful when you know that you connect to the etcd cluster via the set of proxies or when th etcd cluster has static topology.
2019-03-07 11:36:36 +01:00
Alexander Kukushkin
e670122c80 Use create_replica_methods from standby_cluster for replica bootstrap (#981)
It might happen that the standby cluster is configured to be created and replay WAL files from the different source than when it is running not in standby mode.  This is necessary to avoid writing WAL files and backups into the old place after promotion.

The easiest way to achieve such behavior is passing RemoteMember object to `Postgresql.clone` method instead of the usual Member object.
2019-02-21 11:37:50 +01:00
Alexander Kukushkin
0c516de147 Create headless service associated with $SCOPE-config endpoint (#958)
if there is no service defined k8s assumes that endpoint is orphaned and removes it.
Patroni tries to create the service only in case if use_endpoints is enabled if the following cases:
1. Upon start
2. When it tries to (re-)create the config endpoint

If for some reason creation of the service has failed, Patroni will retry it on every cycle of HA loop. Usually it fails due to lack of permissions and if you don't want to give such permissions to the service account used by Patroni, you can create the service explicitly in the deployment manifest.
2019-02-15 13:35:04 +01:00
Alexander Kukushkin
739329b590 Make it possible to automatically reinit the former master (#948)
If the pg_rewind is disabled or can't be used, the former master could fail to start as a new replica due to diverged timelines. In this case, the only way to fix it is wiping the data directory and reinitializing.

So far Patroni was able to remove the data directory only after failed attempt to run pg_rewind. This commit fixes it.
If the `postgresql.remove_data_directory_on_diverged_timelines` is set, Patroni will wipe the data directory and reinitialize the former master automatically.

Fixes: https://github.com/zalando/patroni/issues/941
2019-01-30 12:37:21 +01:00
Alexander Kukushkin
2c128520cf Python34 compatibility (#933)
and some other minor fixes.

Closes https://github.com/zalando/patroni/issues/932
2019-01-16 14:40:05 +01:00
Alexander Kukushkin
381a5b80d2 Release 1.5.4 (#931)
* Bump version
* Update release notes
* Make it possible to configure registration of Service in Consul via env variables
2019-01-15 12:14:19 +01:00
Alexander Kukushkin
71dae6a905 Optionally consider node not healthy if it is not on the latest timeline (#892)
The latest timeline is calculated from the `/history` key in DCS. In case there is no such key or it contains some garbage we consider the node healthy.
Closes https://github.com/zalando/patroni/issues/890
2019-01-15 11:16:30 +01:00
Alexander Kukushkin
e080ded44b Make logging configurable via YAML file (#927)
It allows changing logging settings in runtime by updating config and doing reload or sending `SIGHUP` to the Patroni process.
Important! Environment configuration names related to logging were renamed and documentation accordingly updated. For compatibility reasons Patroni still accepts `PATRONI_LOGLEVEL` and `PATRONI_FORMAT`, but some other variables related to logging, which were introduced only
recently (between releases), will stop working. I think it is ok, since we didn't release the new version yet and therefore it is very unlikely that somebody is using them except authors of corresponding PRs.

Example of log section in the config file:
```yaml
log:
  dir: /where/to/write/patroni/logs  # if not specified, write logs to stderr
  file_size: 50000000  # 50MB
  file_num: 10  # keep history of 10 files
  dateformat: '%Y-%m-%d %H:%M:%S'
  loggers:  # increase log verbosity for etcd.client and urllib3
    etcd.client: DEBUG
    urllib3: DEBUG
```
2019-01-15 08:42:13 +01:00
Alexander Kukushkin
994863c18d Refactor wait_for_user_backends_to_close method (#917)
1. Log only debug level messages on any kind of error
2. Update regexp for matching postgres aux processes to make it compatible with postgres 11

Fixes https://github.com/zalando/patroni/issues/914
2019-01-14 14:55:45 +01:00
Dmitry Dolgov
11f7ceb521 Do not check types of standby_cluster configuration (#924)
Simply allow valid keys
2019-01-14 14:16:15 +01:00
Alexander Kukushkin
f1d7ccf36e Make sure we refresh session at least once per HA loop (#880)
Fixes https://github.com/zalando/patroni/issues/879
2018-12-03 16:35:14 +01:00
Alexander Kukushkin
9bf074acfb Compatibility with python3 (#883)
Change of `loop_wait` was causing Patroni to disconnect from zookeeper and never reconnect back. The error was happening only with python3 due to a difference in implementation of `select.select` function.
2018-11-30 11:40:34 +01:00
Alexander Kukushkin
fb01aaebc5 Compatibility with kazoo-2.6.0 (#872)
Recently 2.6.0 was release which changes the way how create_connection method is called. Before it was passing two arguments, and in the new version all argument names are specified explicitly.
2018-11-19 14:26:20 +01:00
Alexander Kukushkin
0f666e69f3 Prefix system tables, views and functions with pg_catalog (#845)
and implement missing unit tests
2018-11-01 16:17:40 +01:00
Alexander Kukushkin
2efd97baab Permanent replication slots (#819)
Permanent replication slots are preserved on failover/switchover, that is Patroni on the new primary will create configured replication slots right after doing promote.

Slots could be configured with the help of `patronictl edit-config`.
The initial configuration could be also done in the `bootstrap.dcs`

```yaml
slots:
  permanent_physical_1:
    type: physical
  permanent_logical_1:
    type: logical
    database: foo
    plugin: pgoutput
```

It is the responsibility of the operator to make sure that there are no clashes in names between replication slots automatically created by Patroni for members and permanent replication slots.

Closes https://github.com/zalando/patroni/issues/656
2018-10-31 11:37:42 +01:00
Alexander Kukushkin
f70edefc65 A few bugfixes in the "standby cluster" workflow (#823)
* Always run `pg_rewind` against the remote master
* Always use the remote master as the source when "recovering" stopped standby leader
* Use remote master as the source when "recovering" the node in the unhealthy cluster
* Use the local dbname as the fallback when doing `pg_rewind` from the remote master
*  `no_replication_slot` is the allowed key in the `RemoteMember` object
* Make it possible to "bootstrap" the new `standby_cluster` with existing (and valid) data directory. There is one prerequisite though, there should be no `patroni.dynamic.json` file in it!
2018-10-09 13:30:48 +02:00
Alexander Kukushkin
76d1b4cfd8 Minor fixes (#808)
* Use `shutil.move` instead of `os.replace`, which is available only from 3.3
*  Introduce standby-leader health-check and consul service
* Improve unit tests, some lines were not covered
* rename `assertEquals` -> `assertEqual`, due to deprecation warning
2018-09-19 16:32:33 +02:00
Pavel Kirillov
2e9cb412e4 Register service in consul (#802)
Кegister service 'scope_name' with tag 'master' or 'replica'

example with scope 'pgsql-pgpi'
```[root@pgpi1 ~]# host -t SRV pgsql-pgpi.service.consul. 127.0.0.1
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases:

pgsql-pgpi.service.consul has SRV record 1 1 5432 pgpi1.node.dc.consul.
pgsql-pgpi.service.consul has SRV record 1 1 5432 pgpi2.node.dc.consul.
[root@pgpi1 ~]# host -t SRV master.pgsql-pgpi.service.consul. 127.0.0.1
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases:

master.pgsql-pgpi.service.consul has SRV record 1 1 5432 pgpi2.node.dc.consul.
[root@pgpi1 ~]# host -t SRV replica.pgsql-pgpi.service.consul. 127.0.0.1
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases:

replica.pgsql-pgpi.service.consul has SRV record 1 1 5432 pgpi1.node.dc.consul.```

Fixes: https://github.com/zalando/patroni/issues/771
2018-09-07 15:17:56 +02:00
Dmitry Dolgov
dd7c3c349f [WIP] Standby cluster implementation (#679)
Implementation of "standby cluster" described in #657. Standby cluster consists
of a "standby leader", that replicates from a "remote master" (which is not a
part of current patroni cluster and can be anywhere), and cascade replicas,
that replicate from the corresponding standby leader. "Standby leader" behaves
pretty much like a regular leader, which means that it holds a leader lock in
DSC, in case if disappears there will be an election of a new "standby
leader".
One can define such a cluster using the section "standby_cluster" in patroni
config file. This section provides parameters for standby cluster, that will be
applied only once during bootstrap and can be changed only through DSC.
2018-09-07 10:10:56 +02:00
Alexander Kukushkin
4ca8a6e506 Make retries of calls to DCS consistent across implementations (#805)
in addition to that do a small refactoring of zookeeper and consul and try to improve the stability of AT
2018-09-06 08:37:26 +02:00
wilfriedroset
0136f252ab Add patronictl -k/--insecure flag and suport for restapi cert (#790)
Fixes https://github.com/zalando/patroni/issues/785
2018-08-29 16:08:13 +02:00
Alexander Kukushkin
90cf930036 Refactor REST API health-checks (#779)
Make it more readable and easy to understand.
Mostly it is needed to implement https://github.com/zalando/patroni/issues/772
2018-08-29 11:35:22 +02:00
Alexander Kukushkin
87e9aab04c Improve tests (#778)
* Implement missing unit-tests
* Add acceptance tests for ISSUE #776
* Update list of classifiers, keywords and authors
2018-08-29 11:29:37 +02:00
Alexander Kukushkin
0c1ae6fbeb Respond 200 to master health-check only if update_lock was successful (#713)
If Patroni gets partitioned it starts receiving stale information from DCS.
We can't use this information to determine that we have the leader key.
Instead, we will record in Ha object the actual state of acquire/update lock and report as a leader only if it was successful.

P.S. despite responding with 200 on `GET /master` postgres was still running read-only.
2018-08-03 17:00:01 +02:00
Alexander Kukushkin
8a3b78ca7b Rest api thread can raise an exception during shutdown (#711)
catch it and report
2018-06-14 13:17:50 +02:00
Dmitry Dolgov
f0d23b0b14 Merge pull request #706 from zalando/feature/rename-create-replica-method
Rename create_replica_method to create_replica_methods
2018-06-12 14:16:54 +02:00
Alexander Kukushkin
aadd39b0a4 Do crash recovery only when we sure that postgres was running as master (#707)
pg_controldata reports in this case:
* 'in production'
* 'shutting down'
* 'in crash recovery'
2018-06-12 14:09:09 +02:00
Henning Jacobs
2537147810 #694 handle configuration error (#695)
It is possible to change a lot of parameters in runtime (including `restapi.listen`) by updating Patroni config file and sending SIGHUP to Patroni process.

If something was misconfigured it was throwing a weird exception and breaking `restapi` thread.

This PR improves friendliness of error message and avoids breaking of `restapi`.
2018-06-12 14:08:38 +02:00
Alexander Kukushkin
e939304001 Take and apply some parameters from controldata when starting as replica (#703)
* Take and apply some parameters from controldata when starting as replica

https://www.postgresql.org/docs/10/static/hot-standby.html#HOT-STANDBY-ADMIN
There is set of parameters which value on the replica must be not smaller than on the primary, otherwise replica will refuse to start:
* max_connections
* max_prepared_transactions
* max_locks_per_transaction
* max_worker_processes

It might happen that values of these parameters in the global configuration are not set high enough, what makes impossible to start a replica without human intervention. Usually it happens when we bootstrap a new cluster from the basebackup.

As a solution to this problem we will take values of above parameters from the pg_controldata output and in case if the values in the global configuration are not high enough, apply values taken from pg_controldata and set `pending_restart` flag.
2018-06-12 14:04:32 +02:00
Alexander Kukushkin
e405e4e03c Workaround to sporadic unit-test failures (#696)
Fixes https://github.com/zalando/patroni/issues/691
2018-06-12 14:00:10 +02:00
erthalion
d037aa8afd Rename create_replica_method to create_replica_methods
To make it clear that it's actually an array
2018-06-12 11:33:13 +02:00
Alexander Kukushkin
856552bd61 Sync replication slots and verify sysid after coming out of pause (#678)
Fixes https://github.com/zalando/patroni/issues/568
and https://github.com/zalando/patroni/issues/674
2018-05-18 12:18:49 +02:00
Oleksii Kliukin
4ce539ba1b Allow options to the basebackup built-in method. (#604)
Options should be specified in the basebackup section, which is optional.
2018-05-18 12:18:35 +02:00
Oleksii Kliukin
1043376e6b Do not exit when encountering invalid system ID. (#669)
Do not exit when the cluster system ID is empty or the one that doesn't pass the validation check. In that case, the cluster most likely needs a reinit; mention it in the result message.
Avoid terminating Patroni, as otherwise reinit cannot happen.
2018-05-18 11:48:15 +02:00
Alexander Kukushkin
ed479fe585 Don't demote master if failed to update leader key in pause (#668)
Fixes https://github.com/zalando/patroni/issues/659
2018-05-18 11:19:56 +02:00
Alexander Kukushkin
5ce18a8045 Improve protection of DCS being accidentally wiped (#680)
We already have a lot of logic in place to prevent failover in such case and restore all keys, but an accidental removal of `/config` key was effectively switching off pause mode for 1 cycle of HA loop.
2018-05-18 11:18:58 +02:00
Alexander Kukushkin
5296336f4a BUGFIX: postmaster start can fail if pid from postmaster.pid is alive (#681)
Upon start postmaster process performs various safety checks if there is a postmaster.pid file in the data directory. Although Patroni already detected that the running process corresponding to the postmaster.pid is not a postmaster, the new postmaster might fail to start, because it thinks that postmaster.pid is already locked.
Important!!! Unlink of postmaster.pid isn't an option in this case, because it has a lot of nasty race conditions.
Luckily there is a workaround to this problem, we can pass the pid from postmaster.pid in the `PG_GRANDPARENT_PID` environment variable and postmaster will ignore it.

More likely to hit such problem if you run Patroni and postgres in the docker container.
2018-05-18 11:18:27 +02:00
Alexander Kukushkin
84f29caf92 Fix race condition in poll_failover_result (#658)
It didn't affect directly neither failover nor switchover, but in some rare cases it was reporting it as a success too early, when the former leader released the lock: `Failed over to "None" instead of "desired-node"`

In addition to that this commit improves logs and status messages by differentiating between failover and switchover.
2018-04-16 17:45:05 +02:00
Alexander Kukushkin
d78790b194 Abort start if attaching to running postgres and cluster not initiazlied (#661)
Patroni can attach itself to an already running PostgreSQL instance. If that is the first instance "seen" in the given cluster, Patroni for that instance will create the initialize key, grab the leader key and, if the instance is running a replica, promote.

Because of this behavior, when a cluster with a master and one or more replicas gets Patroni for each node, it is imperative to start running Patroni on the master node before getting to the replicas.

This commit changes such weird behavior and will abort Patroni start if there is no initialize key in DCS and postgres is running as a replica.

Closes https://github.com/zalando/patroni/issues/655
2018-04-16 17:32:26 +02:00
Alexander Kukushkin
3afd26101b Single user mode was waiting for user input and never finish (#634)
Regression was introduced in https://github.com/zalando/patroni/pull/576
2018-03-02 22:22:43 +01:00
Alexander Kukushkin
c04e7a1798 Write bootstrap.pg_hba into a pg_hba.conf after custom bootstrap (#632)
Fixes https://github.com/zalando/patroni/issues/631
2018-02-26 18:48:56 +01:00