Files
patroni/kubernetes/README.md
Alexander Kukushkin 4872ac51e0 Citus integration (#2504)
Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together:
```
/service/batman/
/service/batman/0/
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/
/service/batman/1/
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m1
/service/batman/1/members/m2
...
```

Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups.

Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper).

The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group.

Besides that we introduce a new method  get_citus_coordinator(). It will be used only by worker clusters.

Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates.
E.g.
```
batman-0-leader  # the leader config map for the coordinator
batman-0-config  # the config map holding initialize, config, and history "keys"
...
batman-1-leader  # the leader config map for worker group 1
batman-1-config
...
```

Citus integration is enabled from patroni.yaml:
```yaml
citus:
  database: citus
  group: 0  # 0 is for coordinator, 1, 2, etc are for workers
```

If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file.

When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section.

Besides that, Patroni takes over management of some Postgres GUCs:
- `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place
- `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2`
- wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal.

The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using
citus_add_node() and citus_update_node() functions.

Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries.
When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker.
Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked.
If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike.

All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.
2023-01-24 16:14:58 +01:00

6.9 KiB

Kubernetes deployment examples

Below you will find examples of Patroni deployments using kind.

Patroni on K8s

The Patroni cluster deployment with a StatefulSet consisting of three Pods.

Example session:

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.25.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊

$ docker build -t patroni .
Sending build context to Docker daemon  138.8kB
Step 1/9 : FROM postgres:15
...
Successfully built e9bfe69c5d2b
Successfully tagged patroni:latest

$ kind load docker-image patroni
Image: "" with ID "sha256:e9bfe69c5d2b319dec0cf564fb895484537664775e18f37f9b707914cc5537e6" not yet present on node "kind-control-plane", loading...

$ kubectl apply -f patroni_k8s.yaml
service/patronidemo-config created
statefulset.apps/patronidemo created
endpoints/patronidemo created
service/patronidemo created
service/patronidemo-repl created
secret/patronidemo created
serviceaccount/patronidemo created
role.rbac.authorization.k8s.io/patronidemo created
rolebinding.rbac.authorization.k8s.io/patronidemo created
clusterrole.rbac.authorization.k8s.io/patroni-k8s-ep-access created
clusterrolebinding.rbac.authorization.k8s.io/patroni-k8s-ep-access created

$ kubectl get pods -L role
NAME            READY   STATUS    RESTARTS   AGE   ROLE
patronidemo-0   1/1     Running   0          34s   master
patronidemo-1   1/1     Running   0          30s   replica
patronidemo-2   1/1     Running   0          26s   replica

$ kubectl exec -ti patronidemo-0 -- bash
postgres@patronidemo-0:~$ patronictl list
+ Cluster: patronidemo (7186662553319358497) ----+----+-----------+
| Member        | Host       | Role    | State   | TL | Lag in MB |
+---------------+------------+---------+---------+----+-----------+
| patronidemo-0 | 10.244.0.5 | Leader  | running |  1 |           |
| patronidemo-1 | 10.244.0.6 | Replica | running |  1 |         0 |
| patronidemo-2 | 10.244.0.7 | Replica | running |  1 |         0 |
+---------------+------------+---------+---------+----+-----------+

Citus on K8s

The Citus cluster with the StatefulSets, one coordinator with three Pods and two workers with two pods each.

Example session:

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.25.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊

demo@localhost:~/git/patroni/kubernetes$ docker build -f Dockerfile.citus -t patroni-citus-k8s .
Sending build context to Docker daemon  138.8kB
Step 1/11 : FROM postgres:15
...
Successfully built 8cd73e325028
Successfully tagged patroni-citus-k8s:latest

$ kind load docker-image patroni-citus-k8s
Image: "" with ID "sha256:8cd73e325028d7147672494965e53453f5540400928caac0305015eb2c7027c7" not yet present on node "kind-control-plane", loading...

$ kubectl apply -f citus_k8s.yaml
service/citusdemo-0-config created
service/citusdemo-1-config created
service/citusdemo-2-config created
statefulset.apps/citusdemo-0 created
statefulset.apps/citusdemo-1 created
statefulset.apps/citusdemo-2 created
endpoints/citusdemo-0 created
service/citusdemo-0 created
endpoints/citusdemo-1 created
service/citusdemo-1 created
endpoints/citusdemo-2 created
service/citusdemo-2 created
service/citusdemo-workers created
secret/citusdemo created
serviceaccount/citusdemo created
role.rbac.authorization.k8s.io/citusdemo created
rolebinding.rbac.authorization.k8s.io/citusdemo created
clusterrole.rbac.authorization.k8s.io/patroni-k8s-ep-access created
clusterrolebinding.rbac.authorization.k8s.io/patroni-k8s-ep-access created

$ kubectl get sts
NAME          READY   AGE
citusdemo-0   1/3     6s  # coodinator (group=0)
citusdemo-1   1/2     6s  # worker (group=1)
citusdemo-2   1/2     6s  # worker (group=2)

$ kubectl get pods -l cluster-name=citusdemo -L role
NAME            READY   STATUS    RESTARTS   AGE    ROLE
citusdemo-0-0   1/1     Running   0          105s   master
citusdemo-0-1   1/1     Running   0          101s   replica
citusdemo-0-2   1/1     Running   0          96s    replica
citusdemo-1-0   1/1     Running   0          105s   master
citusdemo-1-1   1/1     Running   0          101s   replica
citusdemo-2-0   1/1     Running   0          105s   master
citusdemo-2-1   1/1     Running   0          101s   replica

$ kubectl exec -ti citusdemo-0-0 -- bash
postgres@citusdemo-0-0:~$ patronictl list
+ Citus cluster: citusdemo -----------+--------------+---------+----+-----------+
| Group | Member        | Host        | Role         | State   | TL | Lag in MB |
+-------+---------------+-------------+--------------+---------+----+-----------+
|     0 | citusdemo-0-0 | 10.244.0.10 | Leader       | running |  1 |           |
|     0 | citusdemo-0-1 | 10.244.0.12 | Replica      | running |  1 |         0 |
|     0 | citusdemo-0-2 | 10.244.0.14 | Sync Standby | running |  1 |         0 |
|     1 | citusdemo-1-0 | 10.244.0.8  | Leader       | running |  1 |           |
|     1 | citusdemo-1-1 | 10.244.0.11 | Sync Standby | running |  1 |         0 |
|     2 | citusdemo-2-0 | 10.244.0.9  | Leader       | running |  1 |           |
|     2 | citusdemo-2-1 | 10.244.0.13 | Sync Standby | running |  1 |         0 |
+-------+---------------+-------------+--------------+---------+----+-----------+

postgres@citusdemo-0-0:~$ psql citus
psql (15.1 (Debian 15.1-1.pgdg110+1))
Type "help" for help.

citus=# table pg_dist_node;
 nodeid | groupid |  nodename   | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards
--------+---------+-------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------
      1 |       0 | 10.244.0.10 |     5432 | default  | t           | t        | primary  | default     | t              | f
      2 |       1 | 10.244.0.8  |     5432 | default  | t           | t        | primary  | default     | t              | t
      3 |       2 | 10.244.0.9  |     5432 | default  | t           | t        | primary  | default     | t              | t
(3 rows)