If the Etcd node partitioned from rest of the cluster it is still
possible to read from it (though it returns some stale information),
but it is not possible to write into it.
Previously Patroni was trying to fetch the new cluster view from DCS in
order to figure out is it still the leader or not and Etcd is always
returning stale info where the node still owns the leader key, but with
negative TTL.
This weird bug clearly shows how dangerous premature optimization is.
We are doing number of attempts when trying to initialize replica using
different methods. Any of this attemp may create and put something into
data directory, what causes next attempts fail.
In addition to that improve logging when creating replica.
* reap children before and after running HA loop
When the Patroni is running in a docker container with the pid=1 it is
also responsible for reaping of all dead processes. We can't call
os.waitpid immediately after receiving SIGCHLD because it breaks
subprocess module. It simply stops receiving exit codes of the processes
it executes because these processes. That's why we just registering the
fact of receiving SIGCHLD and reaping children only after execution of
HA loop.
If the postmaster was dying for some reason, Patroni was able to detect
this fact only on the next iteration of HA loop, because zombie
processes where still there and it was possible to send 0 signal to it.
To avoid such situation we should also reap all dead processes before
executing HA loop.
* Don't rely on _cursor_holder when closing connection
it could happen that connection has been opened but not cursor...
* Don't "retry" when fetching current xlog location and it fails
On every iteration of HA loop we are updaing member key in DCS and among
other data there is current xlog location stored in the value.
If the postgres has died for some reason it is not possible to fetch
xlog position and we are just wasting retry_timeout/2 = 5 seconds there.
If this information will be missing from DCS during period of one HA
loop nothing should break. Patroni is not relying on this information
anyway. When it is doing manual or automatic failover it aways
communicates with other nodes directly to get the most fresh
infomation.
* Don't try to update leader optime when postgres is not 100% healthy
`update_lock` method is not only doing update of the leader lock but
also writes the most recent value of xlog position into optime/leader
key. If you know that postgres can be not 100% healthy because it is in
process of restart or recover we should not try to fetch current xlog
position and update 'optime/leader'. Previously we were using
`AsyncExecutor.busy` property for avoiding of such action, but I think
we should be more excpilicit and do the update only if we know that
postgres is 100% healty.
* Avoid retries when syncing replication slots.
Do not retry postgres queries that fetch, create and drop slots at the end of
the HA cycle. The complete run_cycle routine executes with the async_executor
lock. This lock is also used with scheduling operations like reinit or restart
in different threads. Looks like CPython threading class has fairness issues
when multiple threads try to acquire the same lock and one of them executes
long-running actions while holding it: the others have little chances of
acquiring the lock in order. To get around this issue, the long action (i.e.
retrying the query) is removed.
Investigation by Ants Aasma and Alexander Kukushkin.
This error is send by etcd when Patroni is doing "watch" on leader key
which is never updated after creation and etcd cluster receives a lot of
updates, what cleans history of events.
Instead of doing watch on modifiedIndex + 1 we will do watch on X-Etcd-Index,
which is probably still available...
Instead of empying the stale failover key as a master and bailing
out, continue with the healthiest node evaluation. This should make
the actual master acquire the leader key faster. Emit the warning
message as well and add unit tests.
PostgreSQL replication slot names only allow names consisting of [a-z0-9_].
Invalid characters cause replication slot creation and standby startup to fail.
This change substitutes the invalid characters with underscores or unicode
codepoints. In case multiple member names map to identical replication slots
master log will contain a corresponding error message.
Motivated by wanting to use hostnames as member names. Hostnames often
contain periods and dashes.
Any node of the cluster will maintain it's member key until Patroni is
running there.
Master node will also maintain the leader key until postgres is running
as a master. If there is not postgres or it is running 'in_recovery',
Patroni will release leader lock.
Bootstrap of a new cluster will work (it is possible to specify
paused: true) in the `bootstrap.dcs`. Replicas also will be able to join
the cluster if the leader lock exist.
If the postgres is not running on the node it will not try to bring it
up. Also it disables reinitialize and all kind of scheduled actions, i.e.
scheduled restart and scheduled failover.
In case if DCS stops being reachable Patroni will not "demote" master if
the automatic failover was disabled.
Patroni will not stop postgres on exit.
When Patroni was "joining" already running postgres it was not calling
callbacks, what in some cases causing issues (callback could be used to
change routing/load-balancer or assign/remove floating (service) ip.
In addition to that we should `start` postgres instead of `restart`-ing
it when doing recovery, because in this case 'on_start' callback should
be called, instead of 'on_restart'
In the original code we were parsing/deparsing url-style connection
strings back and forth. That was not really resource greedy but rather
annoying. Also it was not really obvious how to switch all local
connections to unix-sockets (preferably).
This commit isolates different use-cases of working with connection
strings and minimizes amount of code parsing and deparsing them. Also it
introduces one new helper method in the `Member` object - `conn_kwargs`.
This method can accept as a parameter dict object with credentials
(username and password). As a result it returns dict object which could
be used by `psycopg2.connect` or for building connection urls for
pg_rewind, pg_basebackup or some other replica creation methods.
Params for local connection are builded in the `_local_connect_kwargs`
method and could be changed to unix-socket later easily.
There is no point to try to update topology until original request is
not performed. Also for us it is more important to execute original
request rather then keep topology of etcd cluster in sync.
In addition to that implement the same retry-timeout logic in the
`machines` property which already is used in `api_execute` method.
* Make different kazoo timeouts dependant on loop_wait
ping timeout ~ 1/2 * loop_wait
connect_timeout ~ 1/2 * loop_wait
Originally these values were calculated from negotiated session timeout
and didn't worked very well, because it was taking significant time to
figure out that connection is dead and reconnect (up to session timeout)
and not giving us time to retry.
* Address the code review
- We don't want to export RestApi object, since it initializes the
socket and listens on it.
- Change get_dcs, so that the explicit scope passed to it will take
priority over the one in the configuration file.
In particular, replace the fixed dates for the future actions
in the unit tests with those that depend on the current date,
avoiding the "timebomb" effect.