Commit Graph

5 Commits

Author SHA1 Message Date
Ants Aasma
32b0768631 Fix watchdog on Python 3 (#531)
A misunderstanding of the ioctl() call interface. If mutable=False then fcntl.ioctl() actually returns the arg buffer back.
This accidentally worked on Python2 because int and str comparison did not return an error.
Error reporting is actually done by raising IOError on Python2 and OSError on Python3.

* Properly handle errors in set_timeout(), have them result in only a warning if watchdog support is not required.

* Improve watchdog device driver name display on Python3

* Eliminate race condition in watchdog feature tests.
  The pinged/closed states were not getting reset properly if the checks ran too quickly.
  Add explicit reset points in feature test so the check is unambiguous.
2017-09-29 10:27:10 +02:00
Ants Aasma
70d718a058 Simplify watchdog code (#452)
* Only activate watchdog while master and not paused

We don't really need the protections while we are not master. This way
we only need to tickle the watchdog when we are updating leader key or
while demotion is happening.

As implemented we might fail to notice to shut down the watchdog if
someone demotes postgres and removes leader key behind Patroni's back.
There are probably other similar cases. Basically if the administrator
if being actively stupid they might get unexpected restarts. That seems
fine.

* Add configuration change support. Change MODE_REQUIRED to disable leader eligibility instead of closing Patroni.

Changes watchdog timeout during the next keepalive when ttl is changed. Watchdog driver and requirement can also be switched online.

When watchdog mode is `required` and watchdog setup does not work then the effect is similar to nofailover. Add watchdog_failed to status API to signify this. This is True only when watchdog does not work **AND** it is required.

* Reset implementation when config changed while active.

* Add watchdog safety margin configuration

Defaults to 5 seconds. Basically this is the maximum amount of time
that can pass between the calls to odcs.update_leader()` and
`watchdog.keepalive()`, which are called right after each other. Should
be safe for pretty much any sane scenario and allows the default
settings to not trigger watchdog when DCS is not responding.

* Cancel bootstrap if watchdog activation fails

The system would have demoted itself anyway the next HA loop. Doing it
in bootstrap gives at least some other node chance to try bootstrapping
in the hope that it is configured correctly.

If all nodes are unable to activate they will continue to try until the
disk is filled with moved datadirs. Perhaps not ideal behavior, but as
the situation is unlikely to resolve itself without administrator
intervention it doesn't seem too bad.
2017-07-27 12:16:11 +02:00
Alexander Kukushkin
d5b3d94377 Custom bootstrap (#454)
Task of restoring a cluster from backup or cloning existing cluster into a new one was floating around for some time. It was kind of possible to achieve it by doing a lot of manual actions and very error prone. So I come up with the idea of making the way how we bootstrap a new cluster configurable.

In short - we want to run a custom script instead of running initdb.
2017-07-18 15:12:58 +02:00
Alexander Kukushkin
acc6d7c2c2 Watchdog unit-tests, bugfixes and questions (#449)
Implement missing unit-tests for and drop unused code
2017-07-11 10:00:30 +02:00
Ants Aasma
a70b46ef13 Add watchdog support on Linux (#343)
Ensures that system gets rebooted before TTL runs out.

Initial version. Open questions:

    Do we want to disable watchdog while we are not master?
2017-06-01 16:53:46 +02:00