1863 Commits

Author SHA1 Message Date
Alexander Kukushkin
311e5ccc5b Release 2.0.1 (#1722)
* Bump version
* Update release notes
v2.0.1
2020-10-01 15:18:53 +02:00
Kostiantyn Nemchenko
00cc62726d Add sslpassword connection parameter support (#1721)
This PR improves compatibility with PostgreSQL 13 by adding one more connection parameter `sslpassword`.

Closes #1719
2020-10-01 14:37:40 +02:00
Alexander Kukushkin
fa88d80c4f Apply master_start_timeout when executing crash recovery (#1720)
It is not very common, but the master Postgres might "crash" due to different reasons, like OOM, or out of disk space. Of course, there are chances that the current node holds some unreplicated data and therefore Patroni by default prefers to start Postgres on the leader node rather than doing a failover.

In order to be on the safe side Patroni always starts Postgres in recovery no matter whether the current node owns the leader lock or not. If the Postgres wasn't shut down cleanly, starting in recovery might fail, therefore in some cases as a workaround Patroni is executing a crash recovery by starting the postgres up in the single-user mode.

A few times we end up in the situation:
1. Master postgres crashed due to the out of disk space
2. Patroni starts crash recovery in a single-user mode
3. While doing crash-recovery Patroni keeps updating the leader lock

It makes Patroni stuck on step 3 and the manual intervention is required for recovering the cluster.

Patroni already has the option `master_start_timeout`, which controls for how long we let postgres stay in the `starting` state and after that Patroni might decide to release the leader lock if there are healthy replicas available which could take it over.

This PR makes the `master_start_timeout` option also work for crash recovery.
2020-09-30 08:04:27 +02:00
Alexander Kukushkin
2c5d62bf10 Workaround unittest bug and fix requirements (#1718)
* unittest bug: https://bugs.python.org/issue25532
* `urllib3[secure]` wrongly depends on `ipaddress` for python3, while in fact we don't need all dependencies of the `secure` extra, but only `ipaddress` for the `kubernetes` on python2.7 

Close https://github.com/zalando/patroni/issues/1717
Close https://github.com/zalando/patroni/issues/1709
2020-09-29 15:15:58 +02:00
sergey grinko
e2b15eacdf Update patroni.service (#1702)
If "WorkingDirectory" not set, defaults to the respective user's home directory if run as user.

Close  #1688
2020-09-28 12:26:17 +02:00
Alexander Kukushkin
885d226dac Add support of raft bind_add and password (#1713)
Close https://github.com/zalando/patroni/issues/1705
2020-09-28 11:05:07 +02:00
Alexander Kukushkin
fa6c396589 Fix bug in the get_guc_value() (#1712)
The `-D` parameter was forgotten.
2020-09-23 13:33:54 +02:00
Alexander Kukushkin
8a8409999d Change the behavior in pause (#1687)
1. Don't call bootstrap if PGDATA is missing/empty, because it might be for purpose, and someone/something working on it.
2. Consider postgres running as a leader in pause not healthy if pg_control sysid doesn't match with the /initialize key (empty initialize key will allow the "race" and the leader will "restore" initialize key).
3. Don't exit on sysid mismatch in pause, only log a warning.
4. Cover corner cases when Patroni started in pause with empty PGDATA and it was restored by somebody else
5. Empty string is a valid `recovery_target`.
2020-09-18 08:25:00 +02:00
Pavlo Golub
e27ff480d0 Allow custom pager support in patronictl edit-config (#1696)
Fixes #1695
2020-09-16 15:21:52 +02:00
Alexander Kukushkin
6706decc1c Fix hanging patronictl when RAFT is being used (#1697)
Close #1694
2020-09-16 14:03:40 +02:00
Alexander Kukushkin
83f9a031b8 Update issue templates (#1678)
We want to avoid ping-ping as much as possible.
2020-09-16 13:51:20 +02:00
Alexander Kukushkin
4dd902fbf1 Fix bug in kubernetes.update_leader (#1685)
Unhandled exception prevented demoting the primary.
In addition to that wrap the update_leader call in the HA loop into try..except block and implement a test case.

Fixes https://github.com/zalando/patroni/issues/1684
2020-09-11 10:19:03 +02:00
Alexander Kukushkin
0a1f389686 Release 2.0.0 (#1680)
* update release notes
* bump version
* change the default alignment in patronictl table output to `left`
* add missing tests
* add missing pieces to the documentation
v2.0.0
2020-09-02 15:35:04 +02:00
Floris van Nee
98f50423ca Add support for configuration directories (#1669) (#1671)
It is now also possible to point the configuration path to a directory instead of a file.
Patroni will find all yml files in the directory and apply them in sorted order

Close https://github.com/zalando/patroni/issues/1669
2020-09-02 13:57:22 +02:00
Alexander Kukushkin
13e24d832d Advanced validation of PostgreSQL parameters (#1674)
So far Patroni was performing a comparison of the old value (in the `pg_settings`) with the new value (from Patroni configuration or from DCS) in order to figure out if reload or restart is required when the parameter has been changed. If the given parameter was missing in the `pg_settings` Patroni was ignoring it and not writing into the `postgresql.conf`.

In case if Postgres is not running, no validation has been performed and parameters and values were written into the config as it is.

It is not a very common mistake, but people tend to mistype parameter names or values.
Also, it happens that some parameters are removed in specific Postgres versions and some new are added (e.g. `checkpoint_segments` replaced with `min_wal_size` and `max_wal_size` in 9.5 or` wal_keep_segments` was replaced with `wal_keep_size` in 13).

Writing nonexistent parameters or invalid values into the `postgresql.conf` makes postgres unstartable.
This change doesn't solve the issue 100%, but at least approaching this goal very close.
2020-09-01 16:26:57 +02:00
Sergey Dudoladov
950eff27ad Optional fencing script (pre_promote) (#1099)
Call a fencing script after acquiring the leader lock. If the script didn't finish successfully - don't promote but remove leader key

Close https://github.com/zalando/patroni/issues/1567
2020-09-01 07:50:39 +02:00
krishna
e87fc12aeb Bug fix for hba remove only if exists post bootstrap (#1670)
Patroni fails If hba conf happens to be located in non pgdata dir after custom bootstrap.

```
2020-08-27 23:25:06,966 INFO: establishing a new patroni connection to the postgres cluster
2020-08-27 23:25:06,997 INFO: running post_bootstrap
2020-08-27 23:25:07,009 ERROR: post_bootstrap
Traceback (most recent call last):
  File "../lib/python3.8/site-packages/patroni/postgresql/bootstrap.py", line 357, in post_bootstrap
    os.unlink(postgresql.config.pg_hba_conf)
FileNotFoundError: [Errno 2] No such file or directory: '../pgrept1/sysdata/pg_hba.conf'
2020-08-27 23:25:07,016 INFO: removing initialize key after failed attempt to bootstrap the cluster
```
2020-08-28 08:25:02 +02:00
Kostiantyn Nemchenko
918a57fe0c Add no_params option for custom bootstrap method (#1664)
Close #1475
2020-08-28 08:23:00 +02:00
Kostiantyn Nemchenko
48aa0ba61b Add SSL support for ZooKeeper (#1662)
Close #1658
2020-08-28 08:22:15 +02:00
Yogesh Sharma
62463db5e2 Add support for user defined HTTP header to Patroni REST API response (#1645)
Close #1644
2020-08-26 17:37:02 +02:00
Victor Sudakov
35c3fd37a1 Security of Patroni (#1655)
Close #1636
2020-08-26 16:42:33 +02:00
Alexander Kukushkin
3e553df69d BUGFIX: pause on K8s (#1659)
On K8s the `Cluster.leader` is a valid object even if the cluster has no leader because we need to know the `resourceVersion` for future CAS operation. Such a non-empty object broke HA loop and made other nodes to think that the leader is there.

The right way to identify the missing leader which reliably works across all DCS is checking that the leader's name is empty.
2020-08-24 16:35:46 +02:00
Feike Steenbergen
e3bc546dd5 Move WAL and tablespaces after a failed init (#1631)
For init processes that use a symlinked WAL directory, or use custom scripts that create new tablespaces, these directories should also be renamed after a failed init attempt, as currently the following errors occur if the first init attempt failed, but a second one might succeed:

      fixing permissions on existing directory /var/lib/postgresql/data ... ok
      initdb: error: directory "/var/lib/postgresql/wal/pg_wal" exists but is not empty
      [...]
      File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1173, in post_bootstrap
        self.cancel_initialization()
      File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1168, in cancel_initialization
        raise PatroniException('Failed to bootstrap cluster')
      patroni.exceptions.PatroniException: 'Failed to bootstrap cluster'

In the remove_data_directory function the same happens for removing the data directory, it seems the same kind of thing should also happen when moving a data directory.

To ensure the data directory can still be used, the symlinks will point to the renamed directories.
2020-08-17 16:12:33 +02:00
Alexander Kukushkin
7bf60b64b0 Compatibility with PostgreSQL 13 (#1654)
So far Patroni was enforcing the same value of `wal_keep_segments` on all nodes in the cluster. If the parameter was missing from the global configuration it was using the default value `8`.
In pg13 beta3 the `wal_keep_segments` was renamed to the `wal_keep_size` and it broke Patroni.

If `wal_keep_segments` happened to be present in the configuration for pg13, Paroni will recalculate the value to `wal_keep_size` assuming that the `wal_segment_size` is 16MB. Sure, it is possible to get the real value of `wal_segment_size` from pg_control, but since we are dealing with the case of misconfiguration it is not worse time spend on it.
2020-08-17 10:45:02 +02:00
Alexander Kukushkin
23dcfaab49 Make it possible to bypass kubernetes service (#1614)
When running on K8s Patroni is communicating with API via the `kubernetes` service, which is address is exposed via the
`KUBERNETES_SERVICE_HOST` environment variable. Like any other service, the `kubernetes` service is handled by `kube-proxy`, that depending on configuration is either relying on userspace program or `iptables` for traffic routing.

During K8s upgrade, when master nodes are replaced, it is possible that `kube-proxy` doesn't update the service configuration in time and as a result Patroni fails to update the leader lock and demotes postgres.

In order to improve the user experience and get more control on the problem we make it possible to bypass the `kubernetes` service and connect directly to API nodes.
The strategy is very simple:
1. Resolve list IPs of API nodes from the kubernetes endpoint on every iteration of HA loop.
2. Stick to one of these IPs for API requests
3. Switch to a different IP if connected to IP is not from the list
4. If the request fails, switch to another IP and retry

Such a strategy is already used for Etcd and proven to work quite well.

In order to enable the feature, you need either to set to `true` `kubernetes.bypass_api_service` in the Patroni configuration file or `PATRONI_KUBERNETES_BYPASS_API_SERVICE` environment variable.

If for some reason `GET /default/endpoints/kubernetes` isn't allowed Patroni will disable the feature.
2020-08-14 12:39:47 +02:00
ksarabu1
1ab709c5f0 Multi Sync Standby Support (#1594)
The new parameter `synchronous_node_count` is used by Patroni to manage number of synchronous standby databases. It is set to 1 by default. It has no effect when synchronous_mode is set to off. When enabled, Patroni manages precise number of synchronous standby databases based on parameter synchronous_node_count and adjusts the state in DCS & synchronous_standby_names as members join and leave.

This functionality can be further extended to support Priority (FIRST n) based synchronous replication & Quorum (ANY n) based synchronous replication in future.
2020-08-14 11:51:07 +02:00
ksarabu1
fce1955218 Fix to skip physical replication slot creation for leader node with special chars (#1651)
Patroni appeared to be creating dormant slot (when `slots` defined) for leader node when the name contains special chars such as '-'  (for e.g. "abc-us-1").
2020-08-13 16:08:14 +02:00
Alexander Kukushkin
a6faf9b2d9 Refactor docker-compose.yml for better compatibility with new version (#1641)
The newest versions of docker-compose want to have some values double-quoted in the env file while old versions failing to process such files.
The solution is simple, move some of the parameters to the `docker-compose.yml` and rely on anchors for inheritance.
Since the main idea behind env files was to keep "secret" information off the main YAML we also get rid of any non-secret stuff, mainly located in the etcd.env.
2020-08-11 09:31:49 +02:00
Alexander Kukushkin
a9915fb3c9 Explicitly disallow patching non-existent config (#1639)
For DCS other than `kubernetes` it was failing with exception due to the `cluster.config` being `None`, but on Kubernetes it was happily creating the config annotation and preventing writing bootstrap configuration after the bootstrap finished.
2020-08-07 09:36:56 +02:00
Alexander Kukushkin
f1c6b0bebe Windows compatibility fixes (#1633)
* pg_rewind error messages contain '/' as directory separator
* fix Raft unit tests on win
* fix validator unit tests on win
* fix keepalive unit tests on win
* make standby cluster behave tests less shaky
2020-07-31 15:43:50 +02:00
Victor Sudakov
d4c6987f78 First variant of notes on PostgreSQL major upgrades. (#1634)
[skip ci]
2020-07-31 15:43:02 +02:00
Alexander Kukushkin
3341c898ff Add Etcd v3 protocol support via api gRPC-gateway (#1162)
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.

Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.

Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.

Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
2020-07-31 14:33:40 +02:00
Alexander Kukushkin
b3c69b5fb2 Fix unit tests (#1630)
exception raised from `logging` module was breaking pytest
2020-07-30 10:33:31 +02:00
Alexander Kukushkin
bfbc4860d5 PoC: Patroni on pure RAFT (#375)
* new node can join the cluster dynamically and become a part of consensus
 * it is also possible to join only Patroni cluster (without adding the node to the raft), just comment or remove `raft.self_addr` for that
 * when the node joins the cluster it is using values from `raft.partner_addrs` only for initial discovery.
* It is possible to run Patroni and Postgres on two nodes plus one node with `patroni_raft_controller` (without Patroni and Postgres). In such setup one can temporarily lose one node without affecting the primary.
2020-07-29 15:34:44 +02:00
Robert Edström
c42d507b82 Add consul service_tags configuration field (#1625)
This is useful for dynamic service discovery, for example by load balancers.
2020-07-28 12:07:24 +02:00
Alexander Kukushkin
59dae9b1bb A few little bug-fixes (#1623)
1. Put the `*` character into pgpass if the actual value is empty
2. Re-raise fatal exception from the HA loop (we need to exit if for example cluster initialization failed)

Close https://github.com/zalando/patroni/issues/1617
2020-07-28 08:37:04 +02:00
Victor Sudakov
20bc5ed684 Update README.rst (#1622)
[ci skip]
2020-07-28 08:36:02 +02:00
Robert Edström
5cc35ec855 Documented required Consul policy (#1626)
[ci skip]
Close #1615
2020-07-28 08:34:37 +02:00
Alexander Kukushkin
38f5f464cb Compatibility fixes (#1616)
* socket.TCP_KEEP* are not defined on windows, macosx, and *bsd
* add [secure] extra to urllib3 for better handling of ip/host checks
2020-07-21 08:13:44 +02:00
Alexander Kukushkin
a68692a3e4 Get rid of kubernetes python module (#1586)
The official python kubernetes client contains a lot of auto-generated code and therefore very heavy, but we need only a little fraction of it.
The naive implementation, that covers all API methods we use, takes about 250 LoC, and about half of it is responsible for the handling of configuration files.

Disadvantage: If somebody was using the `patronictl` outside of the pod (on his machine), it might not work anymore (depending on the environment).
2020-07-17 08:31:58 +02:00
Alexander Kukushkin
ba15806592 Bugfix: Patroni falsely reported postgres as running (#1612)
If postgres is not running due to the crash, panic or because it was stopped by some external force, we need to change the internal state.
So far we always relied on the state being set in the `Postgresql.start()` method. Unfortunately not in all cases this method is reachable. For example, the `Postgresql.follow()` could raise an exception when calling `write_recovery_conf()` and there is not enough disk space to write the file.

In order to solve it, we explicitly set the state when detected that the postgres is not alive. In the `pause` we set it to `stopped`, otherwise to `crashed` if it was `running` or `starting` before.
2020-07-15 10:38:26 +02:00
ksarabu1
8a62999eaa replica & async rest API health check enhancement (#1599)
- ``GET /replica?lag=<max-lag>``: replica check endpoint.
- ``GET /asynchronous?lag=<max-lag>`` or ``GET /async&lag=<max-lag>``: asynchronous standby check endpoint.

Checks replication latency and returns status code **200** only when the latency is below a specified value. The key leader_optime from DCS is used for the leader WAL position and compute latency on the replica for performance reasons. Please note that the value in leader_optime might be a couple of seconds old (based on loop_wait).

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2020-07-15 10:36:48 +02:00
Alexander Kukushkin
04b9fb9dd4 Make sure cached last_leader_operation is up-to-date on replicas (#1600)
Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. So far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date:

* Etcd: since the `/optime/leader` key is updated right after the `/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old.

* ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old.

* Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg.

* Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, which might have negative side effects.

Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds.

Will help to implement #1599
2020-07-15 10:31:32 +02:00
Alexander Kukushkin
db8c634db3 Create readiness and liveness endpoints (#1590)
They could be useful to eliminate "unhealthy" pods from subsets addresses when the K8s service with label selectors are used.
Real-life example: the node where the primary was running has failed and being shutdown and Patroni can't update (remove) the role label.
Therefore on OpenShift the leader service will have two pods assigned, one of them is a failed primary.
With the readiness probe defined, the failed primary pod will be excluded from the list.
2020-07-10 14:08:39 +02:00
Alexander Kukushkin
7a13579973 Refactor tcp_keepalive code (#1578)
* Move it into a separate function
* set keepalive on the REST API socket

The function will be also used in #1162
2020-07-08 14:04:59 +02:00
Alexander Kukushkin
8eb01c77b6 Don't fire on_reload when promoting to standby_leader on 13+ (#1552)
PostgreSQL 13 finally introduced the possibility to change the `primary_conninfo` without a restart. Just doing reload is enough, but in case if the role is changing from the `replica` to the `standby_leader` we want to call only `on_role_change` callback and skip `on_reload`, because they duplicate each other.
2020-06-29 14:49:25 +02:00
Alexander Kukushkin
cbff544b9c Implement patronictl flush switchover (#1554)
It includes implementing the `DELETE /switchover` REST API endpoint.

Close https://github.com/zalando/patroni/issues/1376
2020-06-25 16:27:57 +02:00
Alexander Kukushkin
7f343c2c57 Try to fetch missing WAL if pg_rewind complains about it (#1561)
It could happen that the WAL segment required for `pg_rewind` doesn't exist in the `pg_wal` anymore and therefore `pg_rewind` can't find the checkpoint location before the diverging point.
Starting from PostgreSQL 13 `pg_rewind` could use `restore_command` for fetching missing WALs, but we can do better than that.
On older PostgreSQL versions Patroni will parse the stdout and stderr of failed rewind attempt, try to fetch the missing WAL by calling the `restore_command`, and repeat an attempt.
2020-06-25 16:24:21 +02:00
Alexander Kukushkin
e00acdf6df Fix possible race conditions in update_leader (#1596)
1. Between get_cluster() and update_leader() calls the K8s leader object might be updated from outside and therefore the resource version will not match (error code=409). Since we are watching for all changes, the ObjectCache likely will have the most up-to-date version and we will take advantage of that. There is still a chance to hit a race-condition, but it would be smaller than before. Actually, other DCS are free of this issue. Etcd - update is based on the value comparison, Zookeeper and Consul are relying on session mechanism.
2. If the update still failed - recheck the resource version of the leader object and that the current node is still the leader there and repeat the call.

P.S. The leader race is still relying on the version of the leader object as it was during the get_cluster() call.

In addition to that fixed handling of K8s API errors, we should retry on 500, not on 502.
Close https://github.com/zalando/patroni/issues/1589
2020-06-22 16:07:52 +02:00
Alexander Kukushkin
ee4bf79c11 Populate references and nodename in subsets addresses (#1591)
It makes subsets to exactly look like they were populated by the service with label selector and would help with https://github.com/zalando/postgres-operator/issues/340#issuecomment-587001109

Unit-tests are refactored to minimize amount of mocks.
2020-06-16 12:56:20 +02:00