Zookpeeper implementation heavily relies on cached version of the cluster view in order to minimize the number of requests. Having stale members information is fine for Patroni workflow because it basically relies only on member names and tags.
The `GET /cluster` is a different case. Being exposed outside it might be used for monitoring purposes and therefore we should show the up-to-date members information.
We don't need to rewind when:
1. replayed location for the former replica is not ahead of switchpoint
2. end of checkpoint record for the former primary is the same as switchpoint
In order to get the end of checkpoint record we use the `pg_waldump` and parse its output.
Close https://github.com/zalando/patroni/issues/1493
The standby cluster doesn't know about leader elections in the main cluster and therefore the usual mechanisms of detecting divergences don't work. For example, it could happen that the standby cluster is ahead of the new primary of the main cluster and must be rewound.
There is a way to know that the new timeline has been created by checking the presence of a history file in pg_wal. If the new file is there, we will start usual procedures of making sure that we can continue streaming or will run the pg_rewind.
The `touch_member()` could be called from the finally block of the `_run_cycle()`. In case if it raised an exception the whole Patroni process was crashing.
In order to avoid future crashes we wrap `_run_cycle()` into the try..except block and ask a user to report a BUG.
Close https://github.com/zalando/patroni/issues/1529
When making a decision whether the running replica is able to stream from the new primary or must be rewound we should use replayed location, therefore we extract received and replayed independently.
Reuse the part of the query that extracts the timeline and locations in the REST API.
So far Patroni was parsing `recovery.conf` or querying `pg_settings` in order to get the current values of recovery parameters. On PostgreSQL earlier than 12 it could easily happen that the value of `primary_conninfo` in the `recovery.conf` has nothing to do with reality. Luckily for us, on PostgreSQL 9.6+ there is a `pg_stat_wal_receiver` view, which contains current values of `primary_conninfo` and `primary_slot_name`. The password field is masked through, but this is fine, because authentication happens only during opening the connection. All other parameters we compare as usual.
Another advantage of `pg_stat_wal_recevier` - it contains the current timeline, therefore on 9.6+ we don't need to use the replication connection trick if walreceiver process is alive.
If there is no walreceiver process available or it is not streaming we will stick to old methods.
when Patroni is trying to figure out the necessity of pg_rewind it could write the content history file from the primary into the log. The history file is growing with every failover/switchover and eventually starts taking too many lines in the log, most of them are not so much useful.
Instead of showing the raw data, we will show only 3 lines before the current replica timeline and 2 lines after.
Replicas are waiting for checkpoint indication via member key of the leader in DCS. The key is normally updated only one time per HA loop.
Without waking the main thread up replicas will have to wait up to `loop_wait` seconds longer than necessary.
In dynamic environments it is common that during the rolling upgrade etcd nodes are changing their IP addresses. If the etcd node where Patroni is currently connected to is upgraded last, it could happen that the cached topology doesn't contain any live node anymore and therefore request can't be retried and totally fails, usually resulting in demoting of the primary.
In order to partially overcome the problem, Patroni is already doing a periodic (every 5 minutes) rediscovery of the etcd cluster topology, but in case of very fast node rotation there was still a possibility to hit the issue.
This PR is an attempt to address the problem. If the list of nodes exhausted, Patroni will try to perform initial discovery via an external mechanism, like resolving A or SRV dns records and if the new list is different from the original, Patroni will use it as the new etcd cluster topology.
In order to deal with tcp issues the connect_timeout is set to max(read_timeout/2, 1). It will make list of members exhaust faster, but leaves the time to perform topology rediscovery and another attempt.
The third issue addressed by this PR - it could happen that dns names of etcd nodes didn't change, but ip addresses are new, therefore we clean up the internal dns cache when doing topology rediscovery.
Besides that, this commit makes `_machines_cache` property pretty much static, it will be updated only when the topology has changed and helps to avoid concurrency issues.
It is safe to call pg_rewind on the replica only when pg_control on the primary contains information about the latest timeline. Postgres is usually doing immediate checkpoint right after promote and in most cases it works just fine. Unfortunately we regularly receive complaints that it takes to long (minutes) until the checkpoint is done and replicas can't perform rewind. At the same time doing the checkpoint manually immediately helped. So Patroni starts doing the same. When the promotion happened and postgres is not running in recovery, we explicitly issue the checkpoint.
We are intentionally not using the AsyncExecutor here, because we want the HA loop continues doing its normal flow.
## Feature: Postgres stop timeout
Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or hangs for some reason(Issue described in [1371](https://github.com/zalando/patroni/issues/1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor.
### master_stop_timeout
The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
$ python3 patronictl.py -c postgresql0.yml list
Error: Provided config file postgresql0.yml not existing or no read rights. Check the -c/--config-file parameter
It is a regular issue that primary is recycling WALs when one of the replicas is down for a long time. So far there were only two solutions for such a problem and both of them are not perfect:
1. Increase `wal_keep_segments`, but it is hard to guess the good value.
2. Use continuous archiving and PITR, but it is not always possible.
This PR is introducing the way to solve the problem for static clusters, with a fixed number of nodes and names that never change. You just need to list the names of all nodes in the `slots` so the primary will not remove the slot when the node is down (not registered in DCS).
Of course, the primary will not create the permanent slot which is matching its own name.
Usage example: let's assume you have a cluster with nodes named *abc1*, *abc2*, and *abc3*.
You have to run `patronictl edit-config` and put the following snippet into the configuration:
```yaml
slots:
abc1:
type: physical
abc2:
type: physical
abc3:
type: physical
```
If the node *abc2* is the primary, it will always create slots for *abc1* and *abc3* even if they are not running, but will not create slot *abc2*.
Other nodes will behave the same.
Close#280
During the shutdown Patroni is trying to update its status in the DCS.
If the DCS is inaccessible an exception might be raised. Lack of exception handling prevents logger thread from stopping.
Fixes https://github.com/zalando/patroni/issues/1344
Upon the start of Patroni and Postgres make sure that unix_socket_directories and stats_temp_directory exist or try to create them. Patroni will exit if failed to create them.
Close https://github.com/zalando/patroni/issues/863
That required a refactoring of `Config` and `Patroni` classes. Now one has to explicitely create the instance of `Config` before creating `Patroni`.
The Config file can optionally call the validate function.
We will try to import only the module which has a configuration section.
I.e. if there is only zookeeper section in the config, Patroni will try to import only `patroni.dcs.zookeeper` and skip `etcd`, `consul`, and `kubernetes`.
This approach has two benefits:
1. When there are no dependencies installed Patroni was showing INFO messages `Failed to import smth`, which looks scary.
2. It reduces memory usage, because sometimes dependencies are heavy.
* Implement proper tests for `multiprocessing.set_start_method()`
* Exclude some watchdog code from coverage (it is used only for behave tests)
* properly use os.path.join for windows compatibility
* import DCS modules in `features/environment.py` on demand. It allows to run behave tests against chosen DCS without installing all dependencies.
* remove some unused behave code
* fix some minor issues in the dcs.kubernetes module
There is an opinion that LIST requests with labelSelector to K8s API are expensive and Patroni was doing two such requests per HA loop (LIST pods and LIST endpoints/configmaps).
To efficiently detect object changes we will switch to the LIST+WATCH approach.
The initial LIST request populates the ObjectCache and events from the WATCH request update it.
In addition to that, the ObjectCache will be updated after performing the UPDATE operations on the K8s objects. To avoid race conditions, all operations on ObjectCache are performed after comparing the resource_version of the old and the new objects and rejected if the new resource_version value is smaller than the old one.
The disadvantage of such an approach is that it will require keeping three connections to the K8s API from each Patroni Pod (previously it was two).
Yesterday I deployed this feature branch on our biggest K8s cluster, with ~300 Patroni pods.
The CPU Utilization on K8s master nodes immediately dropped from ~20% to ~10% (two times), and the incoming traffic on master nodes dropped ~7-8 times!
Last, but not least, we get more or less the same impact on etcd cluster behind K8s master nodes, the CPU Utilization dropped nearly twice and outgoing traffic ~7-8 times.