- added pyrightconfig.json with typeCheckingMode=strict
- added type hints to all files except api.py
- added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules
- added type stubs for psycopg2 and urllib3 with some little fixes
- fixes most of the issues reported by pyright
- remaining issues will be addressed later, along with enabling CI linting task
Previously it was called via `pg_ctl`, what required a special quoting of parameters passed to `initdb`.
Co-authored-by: Israel <israel.barth@enterprisedb.com>
* pg_rewind error messages contain '/' as directory separator
* fix Raft unit tests on win
* fix validator unit tests on win
* fix keepalive unit tests on win
* make standby cluster behave tests less shaky
Upon the start of Patroni and Postgres make sure that unix_socket_directories and stats_temp_directory exist or try to create them. Patroni will exit if failed to create them.
Close https://github.com/zalando/patroni/issues/863
* Use `shutil.move` instead of `os.replace`, which is available only from 3.3
* Introduce standby-leader health-check and consul service
* Improve unit tests, some lines were not covered
* rename `assertEquals` -> `assertEqual`, due to deprecation warning
Previously pg_ctl waited for a timeout and then happily trodded on considering PostgreSQL to be running. This caused PostgreSQL to show up in listings as running when it was actually not and caused a race condition that resulted in either a failover or a crash recovery or a crash recovery interrupted by failover and a missed rewind.
This change adds a master_start_timeout parameter and introduces a new state for the main run_cycle loop: starting. When master_start_timeout is zero we will fail over as soon as there is a failover candidate. Otherwise PostgreSQL will be started, but once master_start_timeout expires we will stop and release leader lock if failover is possible. Once failover succeeds or fails (no leader and no one to take the role) we continue with normal processing. While we are waiting for the master timeout we handle manual failover requests.
* Introduce timeout parameter to restart.
When restart timeout is set master becomes eligible for failover after that timeout expires regardless of master_start_time. Immediate restart calls will wait for this timeout to pass, even when node is a standby.
If Patroni was started in a docker with pid=1 it will execute itself
with the same arguments. The original process will take care about init
process duties, i.e. handle sigchld and reap dead orphan processes.
Also it will forward SIGINT, SIGHUP, SIGTERM and some other signals to
the real Patroni process.
It was causing patroni failing to stop after receiving SIGTERM.
Acceptance tests was killing it with SIGKILL which was causing further tests fail because postgres was still running:
2016-06-16 14:36:24,444 INFO: no action. i am the leader with the lock
2016-06-16 14:36:25,448 INFO: Lock owner: postgres0; I am postgres0
2016-06-16 14:36:25,452 ERROR: Failed to update /service/batman/optime/leader
Traceback (most recent call last):
File "/home/akukushkin/git/patroni/patroni/dcs/zookeeper.py", line 208, in write_leader_optime
self._client.retry(self._client.set, path, last_operation)
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/client.py", line 273, in _retry
return self._retry.copy()(*args, **kwargs)
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/retry.py", line 123, in __call__
return func(*args, **kwargs)
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/client.py", line 1219, in set
return self.set_async(path, value, version).get()
File "/home/akukushkin/git/patroni/py2/local/lib/python2.7/site-packages/kazoo/handlers/utils.py", line 74, in get
self._condition.wait(timeout)
File "/usr/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
File "/home/akukushkin/git/patroni/patroni/utils.py", line 219, in sigterm_handler
sys.exit()
SystemExit
2016-06-16 14:36:25,453 INFO: no action. i am the leader with the lock
2016-06-16 14:36:26,443 INFO: Lock owner: postgres0; I am postgres0
2016-06-16 14:36:26,444 INFO: no action. i am the leader with the lock
query method in an api.py also needs retry in some cases (for example
when we are running is_healthiest_node check).
In all cases we should retry only when connection is closed or broken.
BUT, the connection status must be checked via cursor.connection (old
implementation was using general connection object for that). For
multi-threaded applications this is not appropriate, because some other
thread might restore connection.
In addition to that I've changed most of the unit tests to use `Mock` and
`patch` where it is possible.
In case if it was interrupted by SIGCHLD but scheduled awake time is not
reached it will continue sleeping. For all other signals behaviuor is
not changed.