When switching certificates there is a race condition with a concurrent API request. If there is one active during the replacement period then the replacement will error out with a port in use error and Patroni gets stuck in a state without an active API server.
Fix is to call server_close after shutdown which will wait for already running requests to complete before returning.
Close#2184
This field notes the last time (as unix epoch) a cluster member has successfully communicated with the DCS. This is useful to identify and/or analyze network partitions.
Also, expose dcs_last_seen in the MemberStatus class and its from_api_response() method.
If configured, only IPs that matching rules would be allowed to call unsafe endpoints.
In addition to that, it is possible to automatically include IPs of members of the cluster to the list.
If neither of the above is configured the old behavior is retained.
Partially address https://github.com/zalando/patroni/issues/1734
Promoting the standby cluster requires updating load-balancer health checks, which is not very convenient and easy to forget.
In order to solve it, we change the behavior of the `/leader` health-check endpoint. It will return 200 without taking into account whether PostgreSQL is running as the primary or the standby_leader.
Effectively, this PR consists of a few changes:
1. The easy part:
In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.
2. The tricky part:
On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function.
When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.
3. Additional requirements:
In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.
4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.
Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.
The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.
Close: https://github.com/zalando/patroni/issues/1749
This patch fixes the error handling of cases where there are runtime errors in `socketserver`.
For example, when creating a new thread (to handle a request) fails.
`get_request` handles ssl connections by replacing the new client socket by a tuple containing `(server_socket, new_client_socket)` in order to later deal with handshakes in `process_request_thread`
During the processing of a request, the socketserver `BaseServer` calls `handle_request`, calling the `_handle_request_noblock`, which is calling the following functions (https://github.com/python/cpython/blob/3.8/Lib/socketserver.py#L303):
```
request, client_addr = get_request()
verify_request(request, client_address):
process_request(request, client_address)
handle_error(request, client_address)
shutdown_request(request)
```
- `get_request` is overloaded in patroni and returns `request` as a tuple in case of ssl calls
- `verify_request` defaults to `return True` and should be fixed if used but is fine in this case
- `process_request` just calls `process_request_thread` (which is overloaded in patroni and handles tuple-style requests)
- `handle_error` is overloaded in patroni and handles tuple-style requests)
- but `shutdown_request` is not overloaded and thus missing support for tuple-style requests
This patch adds support for tuple-style requests in patroni api
* update release notes
* bump version
* change the default alignment in patronictl table output to `left`
* add missing tests
* add missing pieces to the documentation
The new parameter `synchronous_node_count` is used by Patroni to manage number of synchronous standby databases. It is set to 1 by default. It has no effect when synchronous_mode is set to off. When enabled, Patroni manages precise number of synchronous standby databases based on parameter synchronous_node_count and adjusts the state in DCS & synchronous_standby_names as members join and leave.
This functionality can be further extended to support Priority (FIRST n) based synchronous replication & Quorum (ANY n) based synchronous replication in future.
For DCS other than `kubernetes` it was failing with exception due to the `cluster.config` being `None`, but on Kubernetes it was happily creating the config annotation and preventing writing bootstrap configuration after the bootstrap finished.
- ``GET /replica?lag=<max-lag>``: replica check endpoint.
- ``GET /asynchronous?lag=<max-lag>`` or ``GET /async&lag=<max-lag>``: asynchronous standby check endpoint.
Checks replication latency and returns status code **200** only when the latency is below a specified value. The key leader_optime from DCS is used for the leader WAL position and compute latency on the replica for performance reasons. Please note that the value in leader_optime might be a couple of seconds old (based on loop_wait).
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
They could be useful to eliminate "unhealthy" pods from subsets addresses when the K8s service with label selectors are used.
Real-life example: the node where the primary was running has failed and being shutdown and Patroni can't update (remove) the role label.
Therefore on OpenShift the leader service will have two pods assigned, one of them is a failed primary.
With the readiness probe defined, the failed primary pod will be excluded from the list.
The `SSLSocket` is immediately doing the handshake on accept. Effectively it blocks the whole API thread if the client-side doesn't send any data.
In order to solve the issue we defer the handshake until a thread serving request has started.
The solution is a bit hacky, but thread-safe.
Close https://github.com/zalando/patroni/issues/1545
Zookpeeper implementation heavily relies on cached version of the cluster view in order to minimize the number of requests. Having stale members information is fine for Patroni workflow because it basically relies only on member names and tags.
The `GET /cluster` is a different case. Being exposed outside it might be used for monitoring purposes and therefore we should show the up-to-date members information.
When making a decision whether the running replica is able to stream from the new primary or must be rewound we should use replayed location, therefore we extract received and replayed independently.
Reuse the part of the query that extracts the timeline and locations in the REST API.
The /history endpoint shows the content of the `history` key in DCS
The /cluster endpoint show all cluster members and some service info like pending and scheduled restarts or switchovers.
In addition to that implement `patronictl history`
Close#586Close#675Close#1133
* make it possible to use client certificates with REST API
* define a separate PatroniRequest class which handles all communication
* refactor patronictl to use the new class
* make Ha to use the new class instead of calling requests.get. The old call wasn't taking into account certificates and basic-auth
Close#898
It is possible that some config files are not controlled by Patroni and when somebody is doing reload via REST API or by sending SIGHUP to Patroni process the usual expectation is that postgres will also be reloaded, but it didn't happen when there were no changes in the postgresql section of Patroni config.
For example one might replace ssl_cert_file and ssl_key_file on the filesystem and starting from PostgreSQL 10 it just requires a reload, but Patroni wasn't doing it.
In addition to that fix the issue with handling of `wal_buffers`. The default value depends on `shared_buffers` and `wal_segment_size` and therefore Patroni was exposing pending_restart when the new value in the config was explicitly set to -1 (default).
Close https://github.com/zalando/patroni/issues/1198
A few times we observed that Patroni HA loop was blocked for a few minutes due to not being able to write logs to stderr. This is a very rare condition which we hit so far only on k8s. This commit makes Patroni resilient to such kind of problems. All log messages first are written into the in-memory queue and later they are asynchronously flushed into the stderr or file from a separate thread.
The maximum queue size is configurable and the default value is 1000. This should be enough to keep more than one hour of log messages with default settings and when Patroni cluster operates normally (without big issues).
In case if we hit the maximum size of the queue further logs will be discarded until the queue size will be reduced. The number of discarded messages will be reported into the log later.
In addition to that, the number of non-flushed and discarded messages (if there are any), will be reported via Patroni REST API as:
```json
"logger_queue_size": X,
"logger_records_lost": Y`
```
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package
This is just a careful refactoring, without code changes.
Using `ssl.wrap_socket` is deprecated and was still allowing soon-to-be-deprecated protocols like TLS 1.1.
Now using `SSLContext.create_default_context()` to produce a secure SSL context to wrap the REST API server's socket.
Recently released psycopg2 split into two different packages, psycopg2, and psycopg2-binary which could be installed at the same time into the same place on the filesystem. In order to decrease dependency hell problem, we let a user choose how to install psycopg2. There are a few options available and it is reflected in the documentation.
This PR also changes the following behavior:
* `pip install patroni` will fail if psycopg2 is not installed
* Patroni will check psycopg2 upon start and fail if it can't be found or outdated.
Closes https://github.com/zalando/patroni/issues/1021
The latest timeline is calculated from the `/history` key in DCS. In case there is no such key or it contains some garbage we consider the node healthy.
Closes https://github.com/zalando/patroni/issues/890
* Always run `pg_rewind` against the remote master
* Always use the remote master as the source when "recovering" stopped standby leader
* Use remote master as the source when "recovering" the node in the unhealthy cluster
* Use the local dbname as the fallback when doing `pg_rewind` from the remote master
* `no_replication_slot` is the allowed key in the `RemoteMember` object
* Make it possible to "bootstrap" the new `standby_cluster` with existing (and valid) data directory. There is one prerequisite though, there should be no `patroni.dynamic.json` file in it!
* Use `shutil.move` instead of `os.replace`, which is available only from 3.3
* Introduce standby-leader health-check and consul service
* Improve unit tests, some lines were not covered
* rename `assertEquals` -> `assertEqual`, due to deprecation warning
If Patroni gets partitioned it starts receiving stale information from DCS.
We can't use this information to determine that we have the leader key.
Instead, we will record in Ha object the actual state of acquire/update lock and report as a leader only if it was successful.
P.S. despite responding with 200 on `GET /master` postgres was still running read-only.
It is possible to change a lot of parameters in runtime (including `restapi.listen`) by updating Patroni config file and sending SIGHUP to Patroni process.
If something was misconfigured it was throwing a weird exception and breaking `restapi` thread.
This PR improves friendliness of error message and avoids breaking of `restapi`.
It didn't affect directly neither failover nor switchover, but in some rare cases it was reporting it as a success too early, when the former leader released the lock: `Failed over to "None" instead of "desired-node"`
In addition to that this commit improves logs and status messages by differentiating between failover and switchover.
It is very easy to get current timeline on the master by executing
```sql
SELECT ('x' || SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int
```
Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master.
Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are:
* Second column is the absolute wal position in bytes, instead of LSN
* Optionally there might be a fourth column - timestamp, (mtime of history file)
Make it possible to cancel a running task if you want to reinitialize replica.
There are two possible ways to trigger it:
1. patronictl will ask whether you want to cancel already running task if an attempt to trigger reinitialize has failed
2. if you are using `--force` argument with `patronictl reinit`
in addition to that implement additional checks around manual failover and recover when synchronous_mode is enabled
* Comparison must be case insensitive
* Do not send keepalives if watchdog is not active
* Avoid activating watchdog in a pause mode
* Set correct postgres state in pause mode
* Don't try to run queries from API if postgres is stopped