and run raft behave tests with encryption enabled.
Using the new `pysyncobj` release allowed us to get rid of a lot of hacks with accessing private properties and methods of the parent class and reduce the size of the `raft.py`.
Close https://github.com/zalando/patroni/issues/1746
This commit makes it possible to configure the maximum lag (`maximum_lag_on_syncnode`) after which Patroni will "demote" the node from synchronous and replace it with another node.
The previous implementation always tried to stick to the same synchronous nodes (even if they are not optimal ones).
1. The `ttl` was incorrectly returned 1000 times higher then it should
2. The `watch()` method must return True if the parent method returned True. Not doing so resulted in the incorrect calculation of sleep time.
3. Move mock of exhibitor api to the features/environment.py. It simplifies testing with behave.
When running on K8s Patroni is communicating with API via the `kubernetes` service, which is address is exposed via the
`KUBERNETES_SERVICE_HOST` environment variable. Like any other service, the `kubernetes` service is handled by `kube-proxy`, that depending on configuration is either relying on userspace program or `iptables` for traffic routing.
During K8s upgrade, when master nodes are replaced, it is possible that `kube-proxy` doesn't update the service configuration in time and as a result Patroni fails to update the leader lock and demotes postgres.
In order to improve the user experience and get more control on the problem we make it possible to bypass the `kubernetes` service and connect directly to API nodes.
The strategy is very simple:
1. Resolve list IPs of API nodes from the kubernetes endpoint on every iteration of HA loop.
2. Stick to one of these IPs for API requests
3. Switch to a different IP if connected to IP is not from the list
4. If the request fails, switch to another IP and retry
Such a strategy is already used for Etcd and proven to work quite well.
In order to enable the feature, you need either to set to `true` `kubernetes.bypass_api_service` in the Patroni configuration file or `PATRONI_KUBERNETES_BYPASS_API_SERVICE` environment variable.
If for some reason `GET /default/endpoints/kubernetes` isn't allowed Patroni will disable the feature.
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.
Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.
Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.
Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
* new node can join the cluster dynamically and become a part of consensus
* it is also possible to join only Patroni cluster (without adding the node to the raft), just comment or remove `raft.self_addr` for that
* when the node joins the cluster it is using values from `raft.partner_addrs` only for initial discovery.
* It is possible to run Patroni and Postgres on two nodes plus one node with `patroni_raft_controller` (without Patroni and Postgres). In such setup one can temporarily lose one node without affecting the primary.
The official python kubernetes client contains a lot of auto-generated code and therefore very heavy, but we need only a little fraction of it.
The naive implementation, that covers all API methods we use, takes about 250 LoC, and about half of it is responsible for the handling of configuration files.
Disadvantage: If somebody was using the `patronictl` outside of the pod (on his machine), it might not work anymore (depending on the environment).
* Implement proper tests for `multiprocessing.set_start_method()`
* Exclude some watchdog code from coverage (it is used only for behave tests)
* properly use os.path.join for windows compatibility
* import DCS modules in `features/environment.py` on demand. It allows to run behave tests against chosen DCS without installing all dependencies.
* remove some unused behave code
* fix some minor issues in the dcs.kubernetes module
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.
Closes https://github.com/zalando/patroni/issues/988
Permanent replication slots are preserved on failover/switchover, that is Patroni on the new primary will create configured replication slots right after doing promote.
Slots could be configured with the help of `patronictl edit-config`.
The initial configuration could be also done in the `bootstrap.dcs`
```yaml
slots:
permanent_physical_1:
type: physical
permanent_logical_1:
type: logical
database: foo
plugin: pgoutput
```
It is the responsibility of the operator to make sure that there are no clashes in names between replication slots automatically created by Patroni for members and permanent replication slots.
Closes https://github.com/zalando/patroni/issues/656
* Take and apply some parameters from controldata when starting as replica
https://www.postgresql.org/docs/10/static/hot-standby.html#HOT-STANDBY-ADMIN
There is set of parameters which value on the replica must be not smaller than on the primary, otherwise replica will refuse to start:
* max_connections
* max_prepared_transactions
* max_locks_per_transaction
* max_worker_processes
It might happen that values of these parameters in the global configuration are not set high enough, what makes impossible to start a replica without human intervention. Usually it happens when we bootstrap a new cluster from the basebackup.
As a solution to this problem we will take values of above parameters from the pg_controldata output and in case if the values in the global configuration are not high enough, apply values taken from pg_controldata and set `pending_restart` flag.
* Use ConfigMaps or Endpoins for leader elections and to keep cluster state
* Label pods with a postgres role
* change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`
* Only activate watchdog while master and not paused
We don't really need the protections while we are not master. This way
we only need to tickle the watchdog when we are updating leader key or
while demotion is happening.
As implemented we might fail to notice to shut down the watchdog if
someone demotes postgres and removes leader key behind Patroni's back.
There are probably other similar cases. Basically if the administrator
if being actively stupid they might get unexpected restarts. That seems
fine.
* Add configuration change support. Change MODE_REQUIRED to disable leader eligibility instead of closing Patroni.
Changes watchdog timeout during the next keepalive when ttl is changed. Watchdog driver and requirement can also be switched online.
When watchdog mode is `required` and watchdog setup does not work then the effect is similar to nofailover. Add watchdog_failed to status API to signify this. This is True only when watchdog does not work **AND** it is required.
* Reset implementation when config changed while active.
* Add watchdog safety margin configuration
Defaults to 5 seconds. Basically this is the maximum amount of time
that can pass between the calls to odcs.update_leader()` and
`watchdog.keepalive()`, which are called right after each other. Should
be safe for pretty much any sane scenario and allows the default
settings to not trigger watchdog when DCS is not responding.
* Cancel bootstrap if watchdog activation fails
The system would have demoted itself anyway the next HA loop. Doing it
in bootstrap gives at least some other node chance to try bootstrapping
in the hope that it is configured correctly.
If all nodes are unable to activate they will continue to try until the
disk is filled with moved datadirs. Perhaps not ideal behavior, but as
the situation is unlikely to resolve itself without administrator
intervention it doesn't seem too bad.
Task of restoring a cluster from backup or cloning existing cluster into a new one was floating around for some time. It was kind of possible to achieve it by doing a lot of manual actions and very error prone. So I come up with the idea of making the way how we bootstrap a new cluster configurable.
In short - we want to run a custom script instead of running initdb.
For backward compatibility this feature is not enabled by default. To enable it you have to set `postgresql.use_unix_socket: true`.
If feature is enable, and `unix_socket_directories` is defined and non empty, Patroni will use the first suitable value from it to connect to the local postgres cluster.
If the `unix_socket_directories` is not defined, Patroni will assume that default value should be used and will not pass `host` to command line arguments and omit it from connection url.
Solves: https://github.com/zalando/patroni/issues/61
In addition to mentioned above, this commit solves couple of bugs:
* manual failover with pg_rewind in a pause state was broken
* psycopg2 (or libpq, I am not really sure what exactly) doesn't mark cursors connection as closed when we use unix socket and there is an `OperationalError` occurs. We will close such connection on our own.
Previously we were running pg_rewind only in limited amount of cases:
* when we knew postgres was a master (no recovery.conf in data dir)
* when we were doing a manual switchover to a specific node (no
guaranty that this node is the most up-to-date)
* when a given node has nofailover tag (it could be ahead of new master)
This approach was kind of working in most of the cases, but sometimes we
were executing pg_rewind when it was not necessary and in some other
cases we were not executing it although it was needed.
The main idea of this PR is first try to figure out that we really need
to run pg_rewind by analyzing timelineid, LSN and history file on master
and replica and run it only if it's needed.
Originally Exhibitor was supported in the ZooKeeper class and
configuration for Exhibitor was taken also from `zookeeper` section in
the yaml config file. In fact, Exhibitor just extends ZooKeeper and now
it is reflected in the code and also Exhibitor got it's own section in
the config.yaml file. It will make it easier to configure Exhibitor
hosts and port via environment variables when PR#211 will be merged.