23 Commits

Author SHA1 Message Date
Alexander Kukushkin
ad3d953410 K8s: reset watchers if PATCH fails with 409 (#2283)
High CPU load on Etcd nodes and K8s API servers created a very strange situation. A few clusters were running without a leader and the pod which is ahead of others was failing to take a leader lock because updates were failing with HTTP response code `409` (`resource_version` mismatch).

Effectively that means that TCP connections to K8s master nodes were alive (in the opposite case tcp keepalives would have resolved it), but no `UPDATE` events were arriving via these connections, resulting in the stale cache of the cluster in memory.

The only good way to prevent this situation is to intercept 409 HTTP responses and terminate existing TCP connections used for watches.

Now a few words about implementation. Unfortunately, watch threads are waiting in the read() call most of the time and there is no good way to interrupt them. But, the `socket.shutdown()` seems to do this job. We already used this trick in the Etcd3 implementation.

This approach will help to mitigate the issue of not having a leader, but at the same time replicas might still end up with the stale cluster state cached and in the worst case will not stream from the leader. Non-streaming replicas are less dangerous and could be covered by monitoring and partially mitigated by correctly configured `archive_command` and `restore_command`.
2022-05-19 15:24:20 +02:00
Haitao Li
aa0cd48060 k8s: Support refreshing service account tokens (#2287)
Since Kubernetes v1.21, with projected service account token feature, service account tokens expire in 1 hour. Kubernetes clients are expected to reread the token file to refresh the token.

This patch re-reads the token file very minute for in-cluster config.

Fixes #2286

Signed-off-by: Haitao Li <hli@atlassian.com>
2022-05-05 17:35:06 +02:00
Alexander Kukushkin
c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00
Alexander Kukushkin
9f252d246e Improve handling of concurrent update error (#1796)
The old strategy was waiting for 1 second and hoping that we will get an update event from the WATCH connection.
Unfortunately, it didn't work well in practice. Instead, we will get the current value from the API by performing an explicit read request.

Close https://github.com/zalando/patroni/issues/1767
2021-02-11 15:55:05 +01:00
Alexander Kukushkin
23dcfaab49 Make it possible to bypass kubernetes service (#1614)
When running on K8s Patroni is communicating with API via the `kubernetes` service, which is address is exposed via the
`KUBERNETES_SERVICE_HOST` environment variable. Like any other service, the `kubernetes` service is handled by `kube-proxy`, that depending on configuration is either relying on userspace program or `iptables` for traffic routing.

During K8s upgrade, when master nodes are replaced, it is possible that `kube-proxy` doesn't update the service configuration in time and as a result Patroni fails to update the leader lock and demotes postgres.

In order to improve the user experience and get more control on the problem we make it possible to bypass the `kubernetes` service and connect directly to API nodes.
The strategy is very simple:
1. Resolve list IPs of API nodes from the kubernetes endpoint on every iteration of HA loop.
2. Stick to one of these IPs for API requests
3. Switch to a different IP if connected to IP is not from the list
4. If the request fails, switch to another IP and retry

Such a strategy is already used for Etcd and proven to work quite well.

In order to enable the feature, you need either to set to `true` `kubernetes.bypass_api_service` in the Patroni configuration file or `PATRONI_KUBERNETES_BYPASS_API_SERVICE` environment variable.

If for some reason `GET /default/endpoints/kubernetes` isn't allowed Patroni will disable the feature.
2020-08-14 12:39:47 +02:00
Alexander Kukushkin
a68692a3e4 Get rid of kubernetes python module (#1586)
The official python kubernetes client contains a lot of auto-generated code and therefore very heavy, but we need only a little fraction of it.
The naive implementation, that covers all API methods we use, takes about 250 LoC, and about half of it is responsible for the handling of configuration files.

Disadvantage: If somebody was using the `patronictl` outside of the pod (on his machine), it might not work anymore (depending on the environment).
2020-07-17 08:31:58 +02:00
Alexander Kukushkin
e00acdf6df Fix possible race conditions in update_leader (#1596)
1. Between get_cluster() and update_leader() calls the K8s leader object might be updated from outside and therefore the resource version will not match (error code=409). Since we are watching for all changes, the ObjectCache likely will have the most up-to-date version and we will take advantage of that. There is still a chance to hit a race-condition, but it would be smaller than before. Actually, other DCS are free of this issue. Etcd - update is based on the value comparison, Zookeeper and Consul are relying on session mechanism.
2. If the update still failed - recheck the resource version of the leader object and that the current node is still the leader there and repeat the call.

P.S. The leader race is still relying on the version of the leader object as it was during the get_cluster() call.

In addition to that fixed handling of K8s API errors, we should retry on 500, not on 502.
Close https://github.com/zalando/patroni/issues/1589
2020-06-22 16:07:52 +02:00
Alexander Kukushkin
ee4bf79c11 Populate references and nodename in subsets addresses (#1591)
It makes subsets to exactly look like they were populated by the service with label selector and would help with https://github.com/zalando/postgres-operator/issues/340#issuecomment-587001109

Unit-tests are refactored to minimize amount of mocks.
2020-06-16 12:56:20 +02:00
Alexander Kukushkin
7cf0b753ab Update optime/leader with checkpoint location after clean shut down (#1527)
Potentially this information could be used in order to make sure that there is no data loss on switchover.
2020-05-15 16:13:16 +02:00
Alexander Kukushkin
fe23d1f2d0 Release 1.6.5 (#1503)
* bump version
* update release notes
* implement missing unit-tests and format code.
2020-04-23 16:02:01 +02:00
Alexander Kukushkin
27cda08ece Improve unit-tests (#1479)
* tests were failing on windows and macos
* improve coverage
2020-04-09 10:34:35 +02:00
Alexander Kukushkin
ab38ab2e97 Apply 1 second backoff if LIST failed (#1424)
It is mostly necessary to avoid flooding logs, but also help to prevent starvation of the main thread.
2020-03-10 12:07:26 +01:00
Alexander Kukushkin
6aa3f809d4 Configure keepalive for connections to K8s API (#1366)
In case if we got nothing from the socket after the TTL seconds it should be considered dead.
2020-01-27 09:25:08 +01:00
Alexander Kukushkin
183adb7848 Housekeeping (#1284)
* Implement proper tests for `multiprocessing.set_start_method()`
* Exclude some watchdog code from coverage (it is used only for behave tests)
* properly use os.path.join for windows compatibility
* import DCS modules in `features/environment.py` on demand. It allows to run behave tests against chosen DCS without installing all dependencies.
* remove some unused behave code
* fix some minor issues in the dcs.kubernetes module
2019-11-21 13:27:55 +01:00
Alexander Kukushkin
66d77697ae Use LIST + WATCH when working with K8s API (#1276)
There is an opinion that LIST requests with labelSelector to K8s API are expensive and Patroni was doing two such requests per HA loop (LIST pods and LIST endpoints/configmaps).
To efficiently detect object changes we will switch to the LIST+WATCH approach.
The initial LIST request populates the ObjectCache and events from the WATCH request update it.

In addition to that, the ObjectCache will be updated after performing the UPDATE operations on the K8s objects. To avoid race conditions, all operations on ObjectCache are performed after comparing the resource_version of the old and the new objects and rejected if the new resource_version value is smaller than the old one.

The disadvantage of such an approach is that it will require keeping three connections to the K8s API from each Patroni Pod (previously it was two).

Yesterday I deployed this feature branch on our biggest K8s cluster, with ~300 Patroni pods.
The CPU Utilization on K8s master nodes immediately dropped from ~20% to ~10% (two times), and the incoming traffic on master nodes dropped ~7-8 times!

Last, but not least, we get more or less the same impact on etcd cluster behind K8s master nodes, the CPU Utilization dropped nearly twice and outgoing traffic ~7-8 times.
2019-11-14 14:54:57 +01:00
Alexander Kukushkin
21ed8e2d09 A few small fixes (#1221)
* fix some warnings when running unit-tests
* allow python-kubernetes up to 10.0.1
* python-consul>=0.7.1 is required due to #802
2019-10-11 10:15:22 +02:00
Alexander Kukushkin
86ee22efab Switch to a streaming watcher (#1189)
Watch requests to K8s API either streaming the data or close connection by timeout. In any case it requires a second connection open, but opening a new connection every 10 seconds is more expensive for both, Patroni and K8s API.

Switching to the streaming model also brings other benefits: we can watch not only on leader object, but also on config and wake up Patroni main thread if the config was changed.
2019-10-07 15:16:35 +02:00
Alexander Kukushkin
680444ae13 Reduce lock time taken by dcs.get_cluster() (#989)
`dcs.cluster` and `dcs.get_cluster()` are using the same lock resource and therefore when get_cluster call is slow due to the slowness of DCS it was also affecting the `dcs.cluster` call, which in return was making health-check requests slow.
2019-03-12 22:37:11 +01:00
Alexander Kukushkin
0c516de147 Create headless service associated with $SCOPE-config endpoint (#958)
if there is no service defined k8s assumes that endpoint is orphaned and removes it.
Patroni tries to create the service only in case if use_endpoints is enabled if the following cases:
1. Upon start
2. When it tries to (re-)create the config endpoint

If for some reason creation of the service has failed, Patroni will retry it on every cycle of HA loop. Usually it fails due to lack of permissions and if you don't want to give such permissions to the service account used by Patroni, you can create the service explicitly in the deployment manifest.
2019-02-15 13:35:04 +01:00
Alexander Kukushkin
2efd97baab Permanent replication slots (#819)
Permanent replication slots are preserved on failover/switchover, that is Patroni on the new primary will create configured replication slots right after doing promote.

Slots could be configured with the help of `patronictl edit-config`.
The initial configuration could be also done in the `bootstrap.dcs`

```yaml
slots:
  permanent_physical_1:
    type: physical
  permanent_logical_1:
    type: logical
    database: foo
    plugin: pgoutput
```

It is the responsibility of the operator to make sure that there are no clashes in names between replication slots automatically created by Patroni for members and permanent replication slots.

Closes https://github.com/zalando/patroni/issues/656
2018-10-31 11:37:42 +01:00
Alexander Kukushkin
a0c8491abb Don't swallow silently all errors from k8s API (#611)
Output exception trace to the logs when http status code == 403, something is wrong with permissions.

When http status code == 409 -- such error could be ignored, because object probably was created or updated by another process.

For all other http status codes it will also produce stack traces.

I hope it will help to debug issues similar to the https://github.com/zalando/patroni/issues/606
2018-01-26 09:57:17 +01:00
Alexander Kukushkin
03c2a85d23 Expose current timeline in DCS and via API (#591)
It is very easy to get current timeline on the master by executing
```sql
SELECT ('x' || SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int
```

Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master.

Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are:
* Second column is the absolute wal position in bytes, instead of LSN
* Optionally there might be a fourth column - timestamp, (mtime of history file)
2018-01-05 15:25:56 +01:00
Alexander Kukushkin
4328c15010 Make Patroni Kubernetes native (#500)
* Use ConfigMaps or Endpoins for leader elections and to keep cluster state
* Label pods with a postgres role
* change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`
2017-12-08 16:55:00 +01:00