2020 Commits

Author SHA1 Message Date
Alexander Kukushkin
a5ff38a034 Improve behave tests (#1313)
Hopefully, make them less flaky
2019-12-02 10:33:44 +01:00
Alexander Kukushkin
a3be2958a7 Tidy up setup.py (#1308)
1. Stop using `setuptools.command.test`, it is being deprecated
2. Remove junit integration
2019-11-27 15:56:50 +01:00
Alexander Kukushkin
85341ff78b Use passfile in primary_conninfo only on 10+ (#1301)
somehow passfile happened to work on ubuntu with older postgres versions, but it is not always the case for other distros.
2019-11-27 14:58:23 +01:00
Alexander Kukushkin
7793887ea7 Fix tests on windows (#1303)
and disable junit, it produces a deprecation warning
2019-11-27 14:57:33 +01:00
Alexander Kukushkin
525a26fab5 Solve the problem of cyclic imports (#1306)
Move `PATRONI_ENV_PREFIX` into the `patroni/__init__.py`
2019-11-26 17:03:34 +01:00
Alexander Kukushkin
90a4208390 Get rid from requests module (#1296)
It wasn't used for anything critical anyway, so it doesn't make a lot of sense to keep it as an explicit dependency.
2019-11-22 15:31:55 +01:00
Alexander Kukushkin
f03b85f5b0 Fix calculation of wal_buffes (#1297)
Close https://github.com/zalando/patroni/issues/1288
2019-11-22 15:30:28 +01:00
Alexander Kukushkin
474ac3cc11 Move multiprocessing.set_start_method() back to main (#1295)
It is not possible to call it from forked process
2019-11-21 17:28:14 +01:00
Alexander Kukushkin
412c720d3a Avoid importing all DCS modules (#1286)
We will try to import only the module which has a configuration section.
I.e. if there is only zookeeper section in the config, Patroni will try to import only `patroni.dcs.zookeeper` and skip `etcd`, `consul`, and `kubernetes`.
This approach has two benefits:
1. When there are no dependencies installed Patroni was showing INFO messages `Failed to import smth`, which looks scary.
2. It reduces memory usage, because sometimes dependencies are heavy.
2019-11-21 14:39:37 +01:00
Igor Yanchenko
cd96a10dd2 Provide an example of using multiple etcd endpoints in yaml files (#1289) 2019-11-21 13:29:05 +01:00
Alexander Kukushkin
183adb7848 Housekeeping (#1284)
* Implement proper tests for `multiprocessing.set_start_method()`
* Exclude some watchdog code from coverage (it is used only for behave tests)
* properly use os.path.join for windows compatibility
* import DCS modules in `features/environment.py` on demand. It allows to run behave tests against chosen DCS without installing all dependencies.
* remove some unused behave code
* fix some minor issues in the dcs.kubernetes module
2019-11-21 13:27:55 +01:00
Igor Yanchenko
8b26733f6a striping extra spaces if etcd.hosts is written as comma separated string (#1290)
Patroni was failing to connect to the etcd server if config looks like following:
```yaml
etcd:
    hosts: host1:port1, host2:port2
```
2019-11-21 10:50:14 +01:00
Alexander Kukushkin
35a2ccf8a8 A couple of small fixes in docs (#1285)
* fix formatting in release notes
* fix patronictl reinit command name
2019-11-21 10:39:28 +01:00
Alexander Kukushkin
2f9a48fae4 Release 1.6.1 (#1281)
* Bump version to 1.6.1
* Update release notes
v1.6.1
2019-11-15 12:48:00 +01:00
Maciej Kowalczyk
efcd05ace2 Use "spawn" multiprocessing start method (#1279)
workaround https://bugs.python.org/issue6721

Fixes #1278
2019-11-15 10:56:18 +01:00
Alexander Kukushkin
66d77697ae Use LIST + WATCH when working with K8s API (#1276)
There is an opinion that LIST requests with labelSelector to K8s API are expensive and Patroni was doing two such requests per HA loop (LIST pods and LIST endpoints/configmaps).
To efficiently detect object changes we will switch to the LIST+WATCH approach.
The initial LIST request populates the ObjectCache and events from the WATCH request update it.

In addition to that, the ObjectCache will be updated after performing the UPDATE operations on the K8s objects. To avoid race conditions, all operations on ObjectCache are performed after comparing the resource_version of the old and the new objects and rejected if the new resource_version value is smaller than the old one.

The disadvantage of such an approach is that it will require keeping three connections to the K8s API from each Patroni Pod (previously it was two).

Yesterday I deployed this feature branch on our biggest K8s cluster, with ~300 Patroni pods.
The CPU Utilization on K8s master nodes immediately dropped from ~20% to ~10% (two times), and the incoming traffic on master nodes dropped ~7-8 times!

Last, but not least, we get more or less the same impact on etcd cluster behind K8s master nodes, the CPU Utilization dropped nearly twice and outgoing traffic ~7-8 times.
2019-11-14 14:54:57 +01:00
Alexander Kukushkin
c1adbafbc5 Improve documentation (#1244)
* document tags
* move dynamic configuration out of `bootstrap.dcs`
* document REST API endpoints
2019-11-13 16:10:28 +01:00
Alexander Kukushkin
252a1b78ed Make it possible to change use_slots online (#1261)
Previously it required restarting Patroni and removing slots manually
Fixes https://github.com/zalando/patroni/issues/1158
2019-11-11 16:18:53 +01:00
Alexander Kukushkin
5ea73d50ed Make it possible to apply some recovery params without restart (#1260)
Starting from PostgreSQL 12 the following recovery parameters could be changed without restart, but Patroni didn't yet support it:
* archive_cleanup_command
* promote_trigger_file
* recovery_end_command
* recovery_min_apply_delay

In future postgres releases this list will be extended and Patroni will support it automatically.
2019-11-11 16:18:23 +01:00
Alexander Kukushkin
09a7cf265d Fix 'start failed' issue (#1262)
The start of postgres happens in two stages:
1. First Patroni is waiting for postgres port to be open
2. After that, it is waiting for postgres starts to accept connections

There is a default timeout 60 seconds for both stages (in total).

When the port isn't open, pg_isready exits with code=2.
If postgres is rejecting connections due to recovery, exit code=1.

In most cases postgres quickly opens the port and pg_isready starts returning 1, but in rare cases the whole timeout could spend in `1.`
After that, the HA loop is still waiting for postgres to start, but executing only the check from `2.`. Since pg_isready exit code is still = 2, Patroni was falsely assuming that 'start failed' without taking into consideration the fact that the postmaster process is up and running.

Fixes https://github.com/zalando/patroni/issues/1160
2019-11-11 09:37:06 +01:00
Feike Steenbergen
d2d49907ad Correctly document PATRONI_KUBERNETES_PORTS (#1266)
The previous documentation was wrong and will throw the following error
when used:

        Exception when parsing list {[{"name": "postgresql", "port": 5432}]}

When removing the surrounding braces, the error goes away and the
endpoint is updated with the correct Port name.
2019-11-05 10:09:24 +01:00
Alexander Kukushkin
94b7ff656e Don't give up on retry too early (#1245)
Fixes https://github.com/zalando/patroni/issues/1195
2019-10-31 09:33:16 +01:00
Alexander Kukushkin
29ac77b6e7 Compare all recovery parameters (#1208)
Previously check_recovery_conf() function was only checking whether primary_conninfo has changed and never taking into account all other recovery parameters.

Fixes https://github.com/zalando/patroni/issues/1201
2019-10-30 12:30:09 +01:00
Alexander Kukushkin
9e87b00d36 Kill callback child processes when it is necessary (#1242)
Not doing so makes it hard to implement callbacks in bash and eventually can lead to the situation when two callbacks are running at the same time. In case if we failed to kill the child process we will still wait for it to finish.

The same problem could happen with custom bootstrap, therefore if we happen to kill the custom bootstrap process we also kill all child subprocesses.

Closes https://github.com/zalando/patroni/issues/1238
2019-10-29 12:44:18 +01:00
Alexander Kukushkin
3f711650a7 Fix compatibility with python 3.4&3.5 (#1248)
Close https://github.com/zalando/patroni/issues/1247
2019-10-25 15:01:49 +02:00
Alexander Kukushkin
2a9ef418d6 Return the real member name when picking the sync standby (#1253)
Before we returned it in the lower case, what was preventing such a standby from promoting due to the name comparison mismatch.

Fixes https://github.com/zalando/patroni/issues/1252
2019-10-25 14:53:05 +02:00
Alexander Kukushkin
6fe482a4c8 Avoid calling expensive os.listdir() (#1254)
When the system is under IO stress, `os.listdir()` could take a few seconds (or even minutes) to execute what is badly affecting the HA loop of Patroni and could even cause the leader key to disappear from DCS due to the lack of updates.

There is a better and less expensive way to check that the PGDATA is not empty. Instead of doing the `os.listdir` we simply check the presence of the `global/pg_control` file in it.
2019-10-25 14:52:13 +02:00
Alexander Kukushkin
828585079f Improve workflow when PGDATA is not empty during bootstrap (#1217)
Recently it has happened two times when people tried to deploy the new cluster but postgres data directory wasn't empty and also wasn't valid. In this case Patroni was still creating initialize key in DCS and trying to start the postgres up.
Now it will complain about non-empty invalid postgres data directory and exit.

Close https://github.com/zalando/patroni/issues/1216
2019-10-25 14:09:44 +02:00
Alexander Kukushkin
0947ac1e43 Fix race condition in postmaster_start_time() (#1243)
when it is executed not from the main thread we need to create a new cursor object.
2019-10-24 11:23:34 +02:00
Cody Coons
d770c910fd Remove only PATRONI_ prefixed environment variables (#1224)
it will solve a lot of problems with running different FDW
2019-10-24 08:39:21 +02:00
cobolbaby
732d33812f Add net-tools and iputils-ping to the docker image (#1230)
they might be useful.
2019-10-24 08:36:50 +02:00
Alexander Kukushkin
78a3848e73 Retry on raft internal error (#1241)
Fixes https://github.com/zalando/patroni/issues/1237
2019-10-22 17:20:06 +02:00
Alexander Kukushkin
367d787ff9 Implement /history and /cluster endpoints (#1191)
The /history endpoint shows the content of the `history` key in DCS
The /cluster endpoint show all cluster members and some service info like pending and scheduled restarts or switchovers.

In addition to that implement `patronictl history`

Close #586
Close #675
Close #1133
2019-10-22 17:19:02 +02:00
Alexander Kukushkin
f4623c4e8e Build recovery params in a separate method (#1219)
In addition to that try to protect from the case when some recovery parameters are set in one of included files by explicitly setting their value to an empty string on postgres 12.

Simplifies https://github.com/zalando/patroni/pull/1208
2019-10-11 20:18:06 +02:00
Alexander Kukushkin
863aed314b Fix race conditions in async actions (#1215)
Specifically, there was a chance that `patronictl reinit --force` was overwritten by recover and we end up in a situation when Patroni was trying to start the postgres while basebackup still running.
2019-10-11 10:17:02 +02:00
Alexander Kukushkin
b666f5e4ed Refactor Patroni REST API communication (#1197)
* make it possible to use client certificates with REST API
* define a separate PatroniRequest class which handles all communication
* refactor patronictl to use the new class
* make Ha to use the new class instead of calling requests.get. The old call wasn't taking into account certificates and basic-auth

Close #898
2019-10-11 10:16:33 +02:00
Alexander Kukushkin
21ed8e2d09 A few small fixes (#1221)
* fix some warnings when running unit-tests
* allow python-kubernetes up to 10.0.1
* python-consul>=0.7.1 is required due to #802
2019-10-11 10:15:22 +02:00
Alexander Kukushkin
c95275665f Functions for better parsing of primary_conninfo and recovery.conf (#1218)
Needed to simplify https://github.com/zalando/patroni/pull/1208
2019-10-10 16:00:55 +02:00
Alexander Kukushkin
3d29cb7e50 Perform pg_ctl reload regardless of config changes (#1204)
It is possible that some config files are not controlled by Patroni and when somebody is doing reload via REST API or by sending SIGHUP to Patroni process the usual expectation is that postgres will also be reloaded, but it didn't happen when there were no changes in the postgresql section of Patroni config.

For example one might replace ssl_cert_file and ssl_key_file on the filesystem and starting from PostgreSQL 10 it just requires a reload, but Patroni wasn't doing it.

In addition to that fix the issue with handling of `wal_buffers`. The default value depends on `shared_buffers` and `wal_segment_size` and therefore Patroni was exposing pending_restart when the new value in the config was explicitly set to -1 (default).

Close https://github.com/zalando/patroni/issues/1198
2019-10-10 14:49:30 +02:00
Alexander Kukushkin
1572c02ced Use passfile in the primary_conninfo instead of password (#1194)
Fixed a few minor issues related to the #1134 and #1122
Close https://github.com/zalando/patroni/issues/1185
2019-10-09 18:04:14 +02:00
Alexander Kukushkin
86ee22efab Switch to a streaming watcher (#1189)
Watch requests to K8s API either streaming the data or close connection by timeout. In any case it requires a second connection open, but opening a new connection every 10 seconds is more expensive for both, Patroni and K8s API.

Switching to the streaming model also brings other benefits: we can watch not only on leader object, but also on config and wake up Patroni main thread if the config was changed.
2019-10-07 15:16:35 +02:00
Alexander Kukushkin
facee0186d Explicitly start logger Thread (#1186)
The PatroniLogger object is instantiated in the Patroni constructor and down the road there might be a fatal error causing Patroni process to exit, but live thread prevents the normal shutdown.
In order to mitigate the issue and don't loose ability to use the logging infrastructure we will switch to QueueLogger only when the thread was explicitly started from the Patroni.run() method.

Continuation of https://github.com/zalando/patroni/pull/1178
2019-10-07 11:00:38 +02:00
Alexander Kukushkin
686b2c5432 Fix memory leak on python 3.7 (#1200)
Close https://github.com/zalando/patroni/issues/1167
2019-10-07 10:55:26 +02:00
wilfriedroset
ee678f61d7 Fix typos in documentation (#1202) 2019-10-07 10:34:43 +02:00
Jecho
a8c32a4032 Fix minor typo in documentation #1212
Close #1211
2019-10-07 10:14:15 +02:00
geokala
178e565fe4 Update cacert documentation for use with REST API (#1190)
Fixes #1188
2019-09-24 13:04:07 +02:00
Alexander Kukushkin
fa7eef3d7c Fix logger shutdown behavior (#1178)
Since it is based on Thread with daemon set to True, the shutdown of logger was very likely to happen too early, what was causing some lines not to appear at the destination.

Close https://github.com/zalando/patroni/issues/1173
2019-09-17 12:27:09 +02:00
Jonathan S. Katz
a88704e792 Allow for certificate-based authentication from Patroni PostgreSQL accounts (#1134)
The two principal features this introduces:

1. Provide the Patroni PostgreSQL management accounts (superuser, replication, rewind) to be able to authenticate using certificate-based authentication
2. Allow the user to specify the `sslmode` they wish to connect as.

### References
- [PostgreSQL Certificate Based Authentication](https://www.postgresql.org/docs/current/auth-cert.html)
- [libpq connection parameters](https://www.postgresql.org/docs/current/libpq-connect.html) which are used by psycopg2
- [SSL Modes](https://www.postgresql.org/docs/current/libpq-ssl.html)
2019-09-17 12:14:49 +02:00
anikin-aa
3937a8d4fc Fix status code for GET /replica, when replica is starting (#1152)
Close #772, #1128
2019-08-26 11:18:13 +02:00
Soulou
53d32f1457 Allow lower values for postgresql configuration (#1148)
* Default values have not been changed
* These minimal values still work properly to boot a (small) cluster

Fixes #1142
2019-08-26 10:48:36 +02:00