26 Commits

Author SHA1 Message Date
Alexander Kukushkin
b901e62ad0 Enhanced checks of replica logical slots safety (#2285)
The logical slot on a replica is safe to use when the physical replica
slot on the primary:
1. has a nonzero/non-null `catalog_xmin`
2. has a `catalog_xmin` that is not newer (greater) than the  `catalog_xmin` of any slot on the standby
3. the `catalog_xmin` is known to overtake `catalog_xmin` of logical slots on the primary observed during `1`

In case if `1` doesn't take place, Patroni will run an additional check whether the `hot_standby_feedback` is actually in effect and shows the warning in case it is not.
2022-05-10 12:24:47 +02:00
Alexander Kukushkin
5f6197aaad Don't copy logical slot if there is mismatch with the config (#2274)
A couple of times we have seen in the wild that the database for the permanent logical slots was changed in the Patroni config.

It resulted in the below situation.
On the primary:
1. The slot must be dropped before creating it in a different DB.
2. Patroni fails to drop it because the slot is in use.

Replica:
1. Patroni notice that the slot exists in the wrong DB and successfully dropping it.
2. Patroni copying the existing slot from the primary by its name with Postgres restart.

And the loop repeats while the "wrong" slot exists on the primary.

Basically, replicas are continuously restarting, which badly affects availability.

In order to solve the problem, we will perform additional checks while copying replication slot files from the primary and discard them if `slot_type`, `database`, or `plugin` don't match our expectations.
2022-04-14 12:10:37 +02:00
Alexander Kukushkin
333d41d9f0 Release 2.1.3 (#2219)
* Implement missing unit-tests
* Bump version
* Update release notes
2022-02-18 14:16:15 +01:00
Alexander Kukushkin
fce889cd04 Compatibility with psycopg 3.0 (#2088)
By default `psycopg2` is preferred. The `psycopg>=3.0` will be used only if `psycopg2` is not available or its version is too old.
2021-11-19 14:32:54 +01:00
Alexander Kukushkin
c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00
Mark Mercado
09f2f579d7 Quick attempt at Prometheus (#1848)
Close https://github.com/zalando/patroni/issues/318
2021-03-04 12:37:29 +01:00
Alexander Kukushkin
13e24d832d Advanced validation of PostgreSQL parameters (#1674)
So far Patroni was performing a comparison of the old value (in the `pg_settings`) with the new value (from Patroni configuration or from DCS) in order to figure out if reload or restart is required when the parameter has been changed. If the given parameter was missing in the `pg_settings` Patroni was ignoring it and not writing into the `postgresql.conf`.

In case if Postgres is not running, no validation has been performed and parameters and values were written into the config as it is.

It is not a very common mistake, but people tend to mistype parameter names or values.
Also, it happens that some parameters are removed in specific Postgres versions and some new are added (e.g. `checkpoint_segments` replaced with `min_wal_size` and `max_wal_size` in 9.5 or` wal_keep_segments` was replaced with `wal_keep_size` in 13).

Writing nonexistent parameters or invalid values into the `postgresql.conf` makes postgres unstartable.
This change doesn't solve the issue 100%, but at least approaching this goal very close.
2020-09-01 16:26:57 +02:00
Alexander Kukushkin
3341c898ff Add Etcd v3 protocol support via api gRPC-gateway (#1162)
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.

Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.

Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.

Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
2020-07-31 14:33:40 +02:00
Alexander Kukushkin
a68692a3e4 Get rid of kubernetes python module (#1586)
The official python kubernetes client contains a lot of auto-generated code and therefore very heavy, but we need only a little fraction of it.
The naive implementation, that covers all API methods we use, takes about 250 LoC, and about half of it is responsible for the handling of configuration files.

Disadvantage: If somebody was using the `patronictl` outside of the pod (on his machine), it might not work anymore (depending on the environment).
2020-07-17 08:31:58 +02:00
ksarabu1
8a62999eaa replica & async rest API health check enhancement (#1599)
- ``GET /replica?lag=<max-lag>``: replica check endpoint.
- ``GET /asynchronous?lag=<max-lag>`` or ``GET /async&lag=<max-lag>``: asynchronous standby check endpoint.

Checks replication latency and returns status code **200** only when the latency is below a specified value. The key leader_optime from DCS is used for the leader WAL position and compute latency on the replica for performance reasons. Please note that the value in leader_optime might be a couple of seconds old (based on loop_wait).

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2020-07-15 10:36:48 +02:00
Alexander Kukushkin
98c2081c67 Detect a new timeline in the standby cluster (#1522)
The standby cluster doesn't know about leader elections in the main cluster and therefore the usual mechanisms of detecting divergences don't work. For example, it could happen that the standby cluster is ahead of the new primary of the main cluster and must be rewound.
There is a way to know that the new timeline has been created by checking the presence of a history file in pg_wal. If the new file is there, we will start usual procedures of making sure that we can continue streaming or will run the pg_rewind.
2020-05-29 14:14:47 +02:00
Alexander Kukushkin
6a0d2924a0 Separate received and replayed location (#1514)
When making a decision whether the running replica is able to stream from the new primary or must be rewound we should use replayed location, therefore we extract received and replayed independently.

Reuse the part of the query that extracts the timeline and locations in the REST API.
2020-05-27 13:33:37 +02:00
Alexander Kukushkin
ad5c686c11 Take advantage of pg_stat_wal_recevier (#1513)
So far Patroni was parsing `recovery.conf` or querying `pg_settings` in order to get the current values of recovery parameters. On PostgreSQL earlier than 12 it could easily happen that the value of `primary_conninfo` in the `recovery.conf` has nothing to do with reality. Luckily for us, on PostgreSQL 9.6+ there is a `pg_stat_wal_receiver` view, which contains current values of `primary_conninfo` and `primary_slot_name`. The password field is masked through, but this is fine, because authentication happens only during opening the connection. All other parameters we compare as usual.

Another advantage of `pg_stat_wal_recevier` - it contains the current timeline, therefore on 9.6+ we don't need to use the replication connection trick if walreceiver process is alive.

If there is no walreceiver process available or it is not streaming we will stick to old methods.
2020-05-15 18:04:24 +02:00
Alexander Kukushkin
30aa355eb5 Shorten and beautify history log output (#1526)
when Patroni is trying to figure out the necessity of pg_rewind it could write the content history file from the primary into the log. The history file is growing with every failover/switchover and eventually starts taking too many lines in the log, most of them are not so much useful.
Instead of showing the raw data, we will show only 3 lines before the current replica timeline and 2 lines after.
2020-05-15 16:14:25 +02:00
Alexander Kukushkin
337f9efc9e Improve patronictl list output (#1486)
The redundant column `Column` will be presented in the table header.

Depending on output format `Tags` are serialized differently:
* For *pretty* format YAML is used, every element on the new line
* For *tsv* format for YAML is also used, but all elements and on the same line (similar to JSON)
* For *json* and *yaml* formats `Tags` are serialized into an appropriate format.

<details><summary>Examples of output in pretty formats:</summary>

```bash
$ patronictl list
+ Cluster: batman (6813309862653668387) +---------+----+-----------+---------------------+
|    Member   |      Host      |  Role  |  State  | TL | Lag in MB | Tags                |
+-------------+----------------+--------+---------+----+-----------+---------------------+
| postgresql0 | 127.0.0.1:5432 | Leader | running |  3 |           | clonefrom: true     |
|             |                |        |         |    |           | noloadbalance: true |
|             |                |        |         |    |           | nosync: true        |
+-------------+----------------+--------+---------+----+-----------+---------------------+
| postgresql1 | 127.0.0.1:5433 |        | running |  3 |       0.0 |                     |
+-------------+----------------+--------+---------+----+-----------+---------------------+

$ patronictl list badclustername
+ Cluster: badclustername (uninitialized) ------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+------+------+-------+----+-----------+
+--------+------+------+-------+----+-----------+
```
</details>

<details><summary>Example in tsv format:</summary>

```bash
Cluster Member  Host    Role    State   TL      Lag in MB       Pending restart Tags
batman  postgresql0     127.0.0.1:5432  Leader  running 2
batman  postgresql1     127.0.0.1:5433          running 2       0               {clonefrom: true, nofailover: true, noloadbalance: true, replicatefrom: postgresql0}
batman  postgresql2     127.0.0.1:5434          running 2       0       *       {replicatefrom: postgres1}
```
</details>

In addition to that, `patronictl list` command will stop showing keys with empty values in `json` and `yaml` formats.
<details><summary>Examples:</summary>

```yaml
$ patronictl list -f yaml
- Cluster: batman
  Host: 127.0.0.1:5432
  Member: postgresql0
  Role: Leader
  State: running
  TL: 2
- Cluster: batman
  Host: 127.0.0.1:5433
  Lag in MB: 0
  Member: postgresql1
  State: running
  TL: 2
  Tags:
    clonefrom: true
    nofailover: true
    noloadbalance: true
    replicatefrom: postgresql0
- Cluster: batman
  Host: 127.0.0.1:5434
  Lag in MB: 0
  Member: postgresql2
  Pending restart: '*'
  State: running
  TL: 2
  Tags:
    replicatefrom: postgres1
```

```json
$ patronictl list -f json | jq .
[
  {
    "Cluster": "batman",
    "Member": "postgresql0",
    "Host": "127.0.0.1:5432",
    "Role": "Leader",
    "State": "running",
    "TL": 2
  },
  {
    "Cluster": "batman",
    "Member": "postgresql1",
    "Host": "127.0.0.1:5433",
    "State": "running",
    "TL": 2,
    "Lag in MB": 0,
    "Tags": {
      "nofailover": true,
      "noloadbalance": true,
      "replicatefrom": "postgresql0",
      "clonefrom": true
    }
  },
  {
    "Cluster": "batman",
    "Member": "postgresql2",
    "Host": "127.0.0.1:5434",
    "State": "running",
    "TL": 2,
    "Lag in MB": 0,
    "Pending restart": "*",
    "Tags": {
      "replicatefrom": "postgres1"
    }
  }
]
```
</details>
2020-04-15 12:19:18 +02:00
ksarabu1
e3335bea1a Master stop timeout (#1445)
## Feature: Postgres stop timeout

Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or  hangs for some reason(Issue described in [1371](https://github.com/zalando/patroni/issues/1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor.

### master_stop_timeout

The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
2020-04-15 12:18:49 +02:00
Alexander Kukushkin
16d1ffdde7 Update timeline on standby cluster (#1332)
Fixes https://github.com/zalando/patroni/issues/1031
2019-12-20 12:56:00 +01:00
Igor Yanchenko
7ff27d9e10 Make sure unix_socket_directories and stats_temp_directory exist (#1293)
Upon the start of Patroni and Postgres make sure that unix_socket_directories and stats_temp_directory exist or try to create them. Patroni will exit if failed to create them.

Close https://github.com/zalando/patroni/issues/863
2019-12-11 12:26:17 +01:00
Alexander Kukushkin
7793887ea7 Fix tests on windows (#1303)
and disable junit, it produces a deprecation warning
2019-11-27 14:57:33 +01:00
Alexander Kukushkin
90a4208390 Get rid from requests module (#1296)
It wasn't used for anything critical anyway, so it doesn't make a lot of sense to keep it as an explicit dependency.
2019-11-22 15:31:55 +01:00
Alexander Kukushkin
828585079f Improve workflow when PGDATA is not empty during bootstrap (#1217)
Recently it has happened two times when people tried to deploy the new cluster but postgres data directory wasn't empty and also wasn't valid. In this case Patroni was still creating initialize key in DCS and trying to start the postgres up.
Now it will complain about non-empty invalid postgres data directory and exit.

Close https://github.com/zalando/patroni/issues/1216
2019-10-25 14:09:44 +02:00
Alexander Kukushkin
367d787ff9 Implement /history and /cluster endpoints (#1191)
The /history endpoint shows the content of the `history` key in DCS
The /cluster endpoint show all cluster members and some service info like pending and scheduled restarts or switchovers.

In addition to that implement `patronictl history`

Close #586
Close #675
Close #1133
2019-10-22 17:19:02 +02:00
Alexander Kukushkin
3d29cb7e50 Perform pg_ctl reload regardless of config changes (#1204)
It is possible that some config files are not controlled by Patroni and when somebody is doing reload via REST API or by sending SIGHUP to Patroni process the usual expectation is that postgres will also be reloaded, but it didn't happen when there were no changes in the postgresql section of Patroni config.

For example one might replace ssl_cert_file and ssl_key_file on the filesystem and starting from PostgreSQL 10 it just requires a reload, but Patroni wasn't doing it.

In addition to that fix the issue with handling of `wal_buffers`. The default value depends on `shared_buffers` and `wal_segment_size` and therefore Patroni was exposing pending_restart when the new value in the config was explicitly set to -1 (default).

Close https://github.com/zalando/patroni/issues/1198
2019-10-10 14:49:30 +02:00
Alexander Kukushkin
1572c02ced Use passfile in the primary_conninfo instead of password (#1194)
Fixed a few minor issues related to the #1134 and #1122
Close https://github.com/zalando/patroni/issues/1185
2019-10-09 18:04:14 +02:00
Alexander Kukushkin
0b1b1e3b54 Compatibility with postgresql 12 (#1068)
* use `SHOW primary_conninfo` instead of parsing config file on pg12
* strip out standby and recovery parameters from postgresql.auto.conf before starting the postgres 12

Patroni config remains backward compatible.
Despite for example `restore_command` converted to a GUC starting from postgresql 12, in the Patroni configuration you can still keep it in the `postgresql.recovery_conf` section.
If you put it into `postgresql.parameters.restore_command`, that will also work, but it is important not to mix both ways:
```yaml
# is OK
postgresql:
  parameters:
    restore_command: my_restore_command
    archive_cleanup_command: my_archive_cleanup_command

# is OK
postgresql:
  recovery_conf:
    restore_command: my_restore_command
    archive_cleanup_command: my_archive_cleanup_command

# is NOT ok
postgresql:
  parameters:
    restore_command: my_restore_command
  recovery_conf:
    archive_cleanup_command: my_archive_cleanup_command
```
2019-08-02 16:00:55 +02:00
Alexander Kukushkin
a4bd6a9b4b Refactor postgresql class (#1060)
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package

This is just a careful refactoring, without code changes.
2019-05-21 16:02:47 +02:00