735 Commits

Author SHA1 Message Date
Alexander Kukushkin
5ea73d50ed Make it possible to apply some recovery params without restart (#1260)
Starting from PostgreSQL 12 the following recovery parameters could be changed without restart, but Patroni didn't yet support it:
* archive_cleanup_command
* promote_trigger_file
* recovery_end_command
* recovery_min_apply_delay

In future postgres releases this list will be extended and Patroni will support it automatically.
2019-11-11 16:18:23 +01:00
Alexander Kukushkin
94b7ff656e Don't give up on retry too early (#1245)
Fixes https://github.com/zalando/patroni/issues/1195
2019-10-31 09:33:16 +01:00
Alexander Kukushkin
29ac77b6e7 Compare all recovery parameters (#1208)
Previously check_recovery_conf() function was only checking whether primary_conninfo has changed and never taking into account all other recovery parameters.

Fixes https://github.com/zalando/patroni/issues/1201
2019-10-30 12:30:09 +01:00
Alexander Kukushkin
9e87b00d36 Kill callback child processes when it is necessary (#1242)
Not doing so makes it hard to implement callbacks in bash and eventually can lead to the situation when two callbacks are running at the same time. In case if we failed to kill the child process we will still wait for it to finish.

The same problem could happen with custom bootstrap, therefore if we happen to kill the custom bootstrap process we also kill all child subprocesses.

Closes https://github.com/zalando/patroni/issues/1238
2019-10-29 12:44:18 +01:00
Alexander Kukushkin
3f711650a7 Fix compatibility with python 3.4&3.5 (#1248)
Close https://github.com/zalando/patroni/issues/1247
2019-10-25 15:01:49 +02:00
Alexander Kukushkin
828585079f Improve workflow when PGDATA is not empty during bootstrap (#1217)
Recently it has happened two times when people tried to deploy the new cluster but postgres data directory wasn't empty and also wasn't valid. In this case Patroni was still creating initialize key in DCS and trying to start the postgres up.
Now it will complain about non-empty invalid postgres data directory and exit.

Close https://github.com/zalando/patroni/issues/1216
2019-10-25 14:09:44 +02:00
Alexander Kukushkin
0947ac1e43 Fix race condition in postmaster_start_time() (#1243)
when it is executed not from the main thread we need to create a new cursor object.
2019-10-24 11:23:34 +02:00
Alexander Kukushkin
367d787ff9 Implement /history and /cluster endpoints (#1191)
The /history endpoint shows the content of the `history` key in DCS
The /cluster endpoint show all cluster members and some service info like pending and scheduled restarts or switchovers.

In addition to that implement `patronictl history`

Close #586
Close #675
Close #1133
2019-10-22 17:19:02 +02:00
Alexander Kukushkin
f4623c4e8e Build recovery params in a separate method (#1219)
In addition to that try to protect from the case when some recovery parameters are set in one of included files by explicitly setting their value to an empty string on postgres 12.

Simplifies https://github.com/zalando/patroni/pull/1208
2019-10-11 20:18:06 +02:00
Alexander Kukushkin
863aed314b Fix race conditions in async actions (#1215)
Specifically, there was a chance that `patronictl reinit --force` was overwritten by recover and we end up in a situation when Patroni was trying to start the postgres while basebackup still running.
2019-10-11 10:17:02 +02:00
Alexander Kukushkin
b666f5e4ed Refactor Patroni REST API communication (#1197)
* make it possible to use client certificates with REST API
* define a separate PatroniRequest class which handles all communication
* refactor patronictl to use the new class
* make Ha to use the new class instead of calling requests.get. The old call wasn't taking into account certificates and basic-auth

Close #898
2019-10-11 10:16:33 +02:00
Alexander Kukushkin
21ed8e2d09 A few small fixes (#1221)
* fix some warnings when running unit-tests
* allow python-kubernetes up to 10.0.1
* python-consul>=0.7.1 is required due to #802
2019-10-11 10:15:22 +02:00
Alexander Kukushkin
3d29cb7e50 Perform pg_ctl reload regardless of config changes (#1204)
It is possible that some config files are not controlled by Patroni and when somebody is doing reload via REST API or by sending SIGHUP to Patroni process the usual expectation is that postgres will also be reloaded, but it didn't happen when there were no changes in the postgresql section of Patroni config.

For example one might replace ssl_cert_file and ssl_key_file on the filesystem and starting from PostgreSQL 10 it just requires a reload, but Patroni wasn't doing it.

In addition to that fix the issue with handling of `wal_buffers`. The default value depends on `shared_buffers` and `wal_segment_size` and therefore Patroni was exposing pending_restart when the new value in the config was explicitly set to -1 (default).

Close https://github.com/zalando/patroni/issues/1198
2019-10-10 14:49:30 +02:00
Alexander Kukushkin
1572c02ced Use passfile in the primary_conninfo instead of password (#1194)
Fixed a few minor issues related to the #1134 and #1122
Close https://github.com/zalando/patroni/issues/1185
2019-10-09 18:04:14 +02:00
Alexander Kukushkin
86ee22efab Switch to a streaming watcher (#1189)
Watch requests to K8s API either streaming the data or close connection by timeout. In any case it requires a second connection open, but opening a new connection every 10 seconds is more expensive for both, Patroni and K8s API.

Switching to the streaming model also brings other benefits: we can watch not only on leader object, but also on config and wake up Patroni main thread if the config was changed.
2019-10-07 15:16:35 +02:00
Alexander Kukushkin
facee0186d Explicitly start logger Thread (#1186)
The PatroniLogger object is instantiated in the Patroni constructor and down the road there might be a fatal error causing Patroni process to exit, but live thread prevents the normal shutdown.
In order to mitigate the issue and don't loose ability to use the logging infrastructure we will switch to QueueLogger only when the thread was explicitly started from the Patroni.run() method.

Continuation of https://github.com/zalando/patroni/pull/1178
2019-10-07 11:00:38 +02:00
Alexander Kukushkin
fa7eef3d7c Fix logger shutdown behavior (#1178)
Since it is based on Thread with daemon set to True, the shutdown of logger was very likely to happen too early, what was causing some lines not to appear at the destination.

Close https://github.com/zalando/patroni/issues/1173
2019-09-17 12:27:09 +02:00
Alexander Kukushkin
0a1d9b0a25 Get rid from distutils module dependency (#1146)
We are using only one function from there, `find_executable()` and it is better to implement a similar function in Patroni rather than add `distutils` module into requirements.txt
2019-08-26 09:38:47 +02:00
Alexander Kukushkin
278bf9852b Release 1.6.0 (#1131)
* Implement missing tests and do a few minor fixes
* Bump version to 1.6.0
* Update release notes
2019-08-05 15:08:04 +02:00
Alexander Kukushkin
0b1b1e3b54 Compatibility with postgresql 12 (#1068)
* use `SHOW primary_conninfo` instead of parsing config file on pg12
* strip out standby and recovery parameters from postgresql.auto.conf before starting the postgres 12

Patroni config remains backward compatible.
Despite for example `restore_command` converted to a GUC starting from postgresql 12, in the Patroni configuration you can still keep it in the `postgresql.recovery_conf` section.
If you put it into `postgresql.parameters.restore_command`, that will also work, but it is important not to mix both ways:
```yaml
# is OK
postgresql:
  parameters:
    restore_command: my_restore_command
    archive_cleanup_command: my_archive_cleanup_command

# is OK
postgresql:
  recovery_conf:
    restore_command: my_restore_command
    archive_cleanup_command: my_archive_cleanup_command

# is NOT ok
postgresql:
  parameters:
    restore_command: my_restore_command
  recovery_conf:
    archive_cleanup_command: my_archive_cleanup_command
```
2019-08-02 16:00:55 +02:00
Alexander Kukushkin
4a24b79b73 IPv6 support (#1122)
fixes https://github.com/zalando/patroni/issues/1121
2019-08-02 11:34:29 +02:00
Rafia Sabih
5cc3afc037 Enhance dialogues for scheduled switchover and restart (#1119)
Enhance dialogue for switchover and restart

In case of schedule switchover or restart, mention time if any, when confirming.
2019-08-02 11:21:26 +02:00
Alexander Kukushkin
1a6db4f5af Reverse logic around checkpoint_after_promote (#1084)
It will be set to false in the JSON only until the checkpoint actually happened.

The next improvement of bba9066315
2019-06-17 10:42:31 +02:00
Alexander Kukushkin
fad6d26a3a Small refactoring of postgresql/bootstrap (#1086)
the main purpose of this PR is simplifying #1068

It is mostly necessary for future support of pg12, where there will be no recovery.conf anymore, but `keep_existing_recovery_conf` parameter still needs to be supported due to backward compatibility.
2019-06-17 10:41:13 +02:00
Alexander Kukushkin
37f03790cc Implement two-step logging (#1080)
A few times we observed that Patroni HA loop was blocked for a few minutes due to not being able to write logs to stderr. This is a very rare condition which we hit so far only on k8s. This commit makes Patroni resilient to such kind of problems. All log messages first are written into the in-memory queue and later they are asynchronously flushed into the stderr or file from a separate thread.

The maximum queue size is configurable and the default value is 1000. This should be enough to keep more than one hour of log messages with default settings and when Patroni cluster operates normally (without big issues).

In case if we hit the maximum size of the queue further logs will be discarded until the queue size will be reduced. The number of discarded messages will be reported into the log later.

In addition to that, the number of non-flushed and discarded messages (if there are any), will be reported via Patroni REST API as:
```json
"logger_queue_size": X,
"logger_records_lost": Y`
```
2019-06-13 14:18:49 +02:00
wilfriedroset
2384d9e735 Add API route /health (#1079)
close #119
2019-06-11 15:22:52 +02:00
Alexander Kukushkin
a4bd6a9b4b Refactor postgresql class (#1060)
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package

This is just a careful refactoring, without code changes.
2019-05-21 16:02:47 +02:00
Alexander Kukushkin
e54dfa508d Consider sync node as a healthy even when the former leader is ahead (#1059)
Fixes https://github.com/zalando/patroni/issues/1054
2019-05-13 16:32:53 +02:00
Alexander Kukushkin
4b48653d09 More standby cluster bugfixes (#1053)
1. use the default port is 5432 when only standby_cluster.host is defined
2. check that standby_cluster replica can be bootstrapped without connection to the standby_cluster leader against `create_replica_methods` defined in the `standby_cluster` config instead of the `postgresql` section.
3. Don't fallback to the create_replica_methods defined in the `postgresql` section when bootstrapping a member of the standby cluster.
4. Make sure we specify the database when connecting to the leader.
2019-05-13 14:19:22 +02:00
Alexander Kukushkin
c7db134a20 Make pyscopg2 version parsing more robust (#1052)
non-released version can have unparsable suffixes, for example 2.8.3.dev0
2019-05-13 14:18:56 +02:00
Alexander Kukushkin
bba9066315 Make it possible to run pg_rewind without superuser on pg11+ (#1035)
* expose the current patroni version in DCS
* expose `checkpoint_after_promote` flag in DCS as an indicator that pg_rewind could be safely executed
* other nodes will wait until this flag is set instead of connecting as superuser and issuing the CHECKPOINT
* define `postgresql.authention.rewind` with credentials for pg_rewind in patroni configuration files.
* create user for pg_rewind if postgres is 11+
* grant execute on functions required for pg_rewind to rewind user
2019-05-02 14:07:26 +02:00
Alexander Kukushkin
f0b784fe7f Manage pg_ident.conf with Patroni (#1037)
This functionality works similarly to the `pg_hba`:
If the `postgresql.pg_ident` is defined in the config file or DCS, Patroni will write its value to pg_ident.conf, however, if `postgresql.parameters.ident_file` is defined, Patroni will assume that pg_ident is managed from outside and not update the file.
2019-04-23 16:16:53 +02:00
Julien Riou
663026c34c Use SSLContext to wrap REST API socket (#1039)
Using `ssl.wrap_socket` is deprecated and was still allowing soon-to-be-deprecated protocols like TLS 1.1.
Now using `SSLContext.create_default_context()` to produce a secure SSL context to wrap the REST API server's socket.
2019-04-23 11:23:22 +02:00
Alexander Kukushkin
51b085a76d Don't wait until the previous callback finish is kill failed (#1036)
Such wait was happening in the main thread and blocking HA loop.
After all the executor thread was doing absolutely the same.
2019-04-15 15:49:06 +02:00
Alexander Kukushkin
7c0c9599fc Remove psycopg2 from requirements (#1023)
Recently released psycopg2 split into two different packages, psycopg2, and psycopg2-binary which could be installed at the same time into the same place on the filesystem. In order to decrease dependency hell problem, we let a user choose how to install psycopg2. There are a few options available and it is reflected in the documentation.

This PR also changes the following behavior:
* `pip install patroni` will fail if psycopg2 is not installed
* Patroni will check psycopg2 upon start and fail if it can't be found or outdated.

Closes https://github.com/zalando/patroni/issues/1021
2019-04-15 14:30:16 +02:00
Pavlo Golub
b53a29c022 Fix unit-tests for Windows (#1014)
Closes #1013
2019-04-02 13:58:17 +02:00
Alexander Kukushkin
e38fe78b56 Fix callbacks behavior (mostly for standby cluster) (#998)
First of all, this patch changes the behavior of `on_start`/`on_restart` callbacks, they will be called only when postgres is started or restarted without role changes. In case if the member is promoted or demoted only the `on_role_change` callback will be executed. `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.
Before that `on_role_change` was never called for standby leader, only `on_start`/`on_restart` and with a wrong role argument.

In addition to that, the REST API will return standby_leader role for the leader of the standby cluster.

Closes https://github.com/zalando/patroni/issues/988
2019-03-29 10:28:07 +01:00
Alexander Kukushkin
680444ae13 Reduce lock time taken by dcs.get_cluster() (#989)
`dcs.cluster` and `dcs.get_cluster()` are using the same lock resource and therefore when get_cluster call is slow due to the slowness of DCS it was also affecting the `dcs.cluster` call, which in return was making health-check requests slow.
2019-03-12 22:37:11 +01:00
Alexander Kukushkin
92720882aa Reset is_leader flag for every removal of leader key (#990)
This is the next improvement of #777
2019-03-12 22:10:46 +01:00
Alexander Kukushkin
13c88e8b7a Replace self-execute with multiprocessing.Process (#994)
In addition to that transfer postmaster pid to Patroni process with the help of multiprocessing.Pipe instead of using stdin-stdout pipes.

Closes https://github.com/zalando/patroni/issues/992
2019-03-12 10:40:37 +01:00
Alexander Kukushkin
4a4258fc3f Mock external resources (#995)
unit tests should not accidentally hit running Postgres, DCS or filesystem unless we want it explicitly.
2019-03-12 10:39:42 +01:00
Alexander Kukushkin
c64d51f79c Better support for static etcd cluster (#986)
if the `etcd.use_proxies` is set to true, Patroni will stick to the list of hosts specified in the `etcd.hosts` and avoid doing topology discovery. Such mode might be useful when you know that you connect to the etcd cluster via the set of proxies or when th etcd cluster has static topology.
2019-03-07 11:36:36 +01:00
Alexander Kukushkin
e670122c80 Use create_replica_methods from standby_cluster for replica bootstrap (#981)
It might happen that the standby cluster is configured to be created and replay WAL files from the different source than when it is running not in standby mode.  This is necessary to avoid writing WAL files and backups into the old place after promotion.

The easiest way to achieve such behavior is passing RemoteMember object to `Postgresql.clone` method instead of the usual Member object.
2019-02-21 11:37:50 +01:00
Alexander Kukushkin
0c516de147 Create headless service associated with $SCOPE-config endpoint (#958)
if there is no service defined k8s assumes that endpoint is orphaned and removes it.
Patroni tries to create the service only in case if use_endpoints is enabled if the following cases:
1. Upon start
2. When it tries to (re-)create the config endpoint

If for some reason creation of the service has failed, Patroni will retry it on every cycle of HA loop. Usually it fails due to lack of permissions and if you don't want to give such permissions to the service account used by Patroni, you can create the service explicitly in the deployment manifest.
2019-02-15 13:35:04 +01:00
Alexander Kukushkin
739329b590 Make it possible to automatically reinit the former master (#948)
If the pg_rewind is disabled or can't be used, the former master could fail to start as a new replica due to diverged timelines. In this case, the only way to fix it is wiping the data directory and reinitializing.

So far Patroni was able to remove the data directory only after failed attempt to run pg_rewind. This commit fixes it.
If the `postgresql.remove_data_directory_on_diverged_timelines` is set, Patroni will wipe the data directory and reinitialize the former master automatically.

Fixes: https://github.com/zalando/patroni/issues/941
2019-01-30 12:37:21 +01:00
Alexander Kukushkin
2c128520cf Python34 compatibility (#933)
and some other minor fixes.

Closes https://github.com/zalando/patroni/issues/932
2019-01-16 14:40:05 +01:00
Alexander Kukushkin
381a5b80d2 Release 1.5.4 (#931)
* Bump version
* Update release notes
* Make it possible to configure registration of Service in Consul via env variables
2019-01-15 12:14:19 +01:00
Alexander Kukushkin
71dae6a905 Optionally consider node not healthy if it is not on the latest timeline (#892)
The latest timeline is calculated from the `/history` key in DCS. In case there is no such key or it contains some garbage we consider the node healthy.
Closes https://github.com/zalando/patroni/issues/890
2019-01-15 11:16:30 +01:00
Alexander Kukushkin
e080ded44b Make logging configurable via YAML file (#927)
It allows changing logging settings in runtime by updating config and doing reload or sending `SIGHUP` to the Patroni process.
Important! Environment configuration names related to logging were renamed and documentation accordingly updated. For compatibility reasons Patroni still accepts `PATRONI_LOGLEVEL` and `PATRONI_FORMAT`, but some other variables related to logging, which were introduced only
recently (between releases), will stop working. I think it is ok, since we didn't release the new version yet and therefore it is very unlikely that somebody is using them except authors of corresponding PRs.

Example of log section in the config file:
```yaml
log:
  dir: /where/to/write/patroni/logs  # if not specified, write logs to stderr
  file_size: 50000000  # 50MB
  file_num: 10  # keep history of 10 files
  dateformat: '%Y-%m-%d %H:%M:%S'
  loggers:  # increase log verbosity for etcd.client and urllib3
    etcd.client: DEBUG
    urllib3: DEBUG
```
2019-01-15 08:42:13 +01:00
Alexander Kukushkin
994863c18d Refactor wait_for_user_backends_to_close method (#917)
1. Log only debug level messages on any kind of error
2. Update regexp for matching postgres aux processes to make it compatible with postgres 11

Fixes https://github.com/zalando/patroni/issues/914
2019-01-14 14:55:45 +01:00