Add a field to an api to figure out if a master is there from patroni point
of view. It can be useful, when you have an alert, based on Auto Scaling
Groups, and then ASG decided to shutdown the current master, spin up a
new instance but the current master shutdown is stuck. In this situation
the current master is no longer a part of ASG, but patroni and Postgres
are still alive on the instance, which means a new replica will not be
promoted yet - this will lead to a false alert, saying that your cluster
doesn't have any master node.
This adds INFO log messages that clearly state if configuration values were seen as changed by Patroni after SIGHUP/reload and warrant reloading (or if nothing was changed an no reloading is necessary).
This ended up being a lot simpler than I had imagined once I found postgresql.py:reload_config().
I add a log line in config.py:reload_local_configuration() since that function will short-circuit the process early if the local config wasn't changed. But the final determination of whether or not values have changed and need reloading is in postgresql.py:reload_config().
If Patroni gets partitioned it starts receiving stale information from DCS.
We can't use this information to determine that we have the leader key.
Instead, we will record in Ha object the actual state of acquire/update lock and report as a leader only if it was successful.
P.S. despite responding with 200 on `GET /master` postgres was still running read-only.
In really rare cases it was causing following behavior:
```
2018-07-31 10:35:30,302 INFO: starting as a secondary
2018-07-31 10:35:30,309 INFO: Lock owner: postgresql0; I am postgresql1
2018-07-31 10:35:30,310 INFO: Demoting master during restarting after failure
2018-07-31 10:35:30,381 INFO: postmaster pid=17709
2018-07-31 10:35:30,386 INFO: lost leader lock during restarting after failure
2018-07-31 10:35:30,388 ERROR: Exception during CHECKPOINT
```
* async is a keyword in python3.7
Setting up patroni (1.4.4-1) ...
File "/usr/lib/python3/dist-packages/patroni/ha.py", line 610
'offline': dict(stop='fast', checkpoint=False, release=False, offline=True, async=False),
^
SyntaxError: invalid syntax
Fix#750 by replacing dict member "async" with "async_req".
* requirements.txt: Update for new kubernetes version compatible with 3.7
'patronictl remove' deletes the cluster configuration (stored either in configmaps or endpoints) and cannot be run from the postgres pod w/o 'delete' on those objects being granted to the pod service account.
Add an EnvironmentFile directive to read in a configuration file with environment variables. The "-" prefix means it can proceed if the file doesn't exist.
This would allow users to keep sensitive information like the SUPERUSER/REPLICATION passwords in the config file separate from a YAML file that might be deployed from source control.
Patroni is relying on params to determinte timeout and amount of retries when executing api requests to consul. Starting from v1.1.0 python-consul changed internal API and started
using `list` instead of `dict` to pass query parameters. Such change broke "watch" functionality.
Fixes https://github.com/zalando/patroni/issues/742 and
https://github.com/zalando/patroni/issues/734
Currently the informational message logged is beyond confusing. This
improves the logging so there is some indication what this message is
about and that it is somewhat normal. Changes by @ants
Fix the discrepancy for the values of max_wal_senders and max_replication_slots between the sample postgres.yml files and hard-coded defaults in Patroni, bumping the former to 10.
Contributed by @dtseiler
It is possible to change a lot of parameters in runtime (including `restapi.listen`) by updating Patroni config file and sending SIGHUP to Patroni process.
If something was misconfigured it was throwing a weird exception and breaking `restapi` thread.
This PR improves friendliness of error message and avoids breaking of `restapi`.
* Take and apply some parameters from controldata when starting as replica
https://www.postgresql.org/docs/10/static/hot-standby.html#HOT-STANDBY-ADMIN
There is set of parameters which value on the replica must be not smaller than on the primary, otherwise replica will refuse to start:
* max_connections
* max_prepared_transactions
* max_locks_per_transaction
* max_worker_processes
It might happen that values of these parameters in the global configuration are not set high enough, what makes impossible to start a replica without human intervention. Usually it happens when we bootstrap a new cluster from the basebackup.
As a solution to this problem we will take values of above parameters from the pg_controldata output and in case if the values in the global configuration are not high enough, apply values taken from pg_controldata and set `pending_restart` flag.
Do not exit when the cluster system ID is empty or the one that doesn't pass the validation check. In that case, the cluster most likely needs a reinit; mention it in the result message.
Avoid terminating Patroni, as otherwise reinit cannot happen.
We already have a lot of logic in place to prevent failover in such case and restore all keys, but an accidental removal of `/config` key was effectively switching off pause mode for 1 cycle of HA loop.
Upon start postmaster process performs various safety checks if there is a postmaster.pid file in the data directory. Although Patroni already detected that the running process corresponding to the postmaster.pid is not a postmaster, the new postmaster might fail to start, because it thinks that postmaster.pid is already locked.
Important!!! Unlink of postmaster.pid isn't an option in this case, because it has a lot of nasty race conditions.
Luckily there is a workaround to this problem, we can pass the pid from postmaster.pid in the `PG_GRANDPARENT_PID` environment variable and postmaster will ignore it.
More likely to hit such problem if you run Patroni and postgres in the docker container.
On Kubernetes 1.10.0 I experienced an issue where calls to `patch_or_create` were failing when bootstraping a cluster. The call was failing because `self._leader_observed_subsets` was `None` instead of `[]`.
It didn't affect directly neither failover nor switchover, but in some rare cases it was reporting it as a success too early, when the former leader released the lock: `Failed over to "None" instead of "desired-node"`
In addition to that this commit improves logs and status messages by differentiating between failover and switchover.
Patroni can attach itself to an already running PostgreSQL instance. If that is the first instance "seen" in the given cluster, Patroni for that instance will create the initialize key, grab the leader key and, if the instance is running a replica, promote.
Because of this behavior, when a cluster with a master and one or more replicas gets Patroni for each node, it is imperative to start running Patroni on the master node before getting to the replicas.
This commit changes such weird behavior and will abort Patroni start if there is no initialize key in DCS and postgres is running as a replica.
Closes https://github.com/zalando/patroni/issues/655
Because they are indeed case insensitive.
Most of the parameters have snake_case_name, but there are three exceptions from this rule: DateStyle, IntervalStyle and TimeZone.
In fact, if you specify timezone = 'some/tzn' it still works, but Patroni wasn't able to find 'timezone' in pg_settings and stripping this parameter out.
We will use CaseInsensitiveDict to keep postgresql.parameters. This change affects only "final" configuration. That means if you put some"duplicates" (work_mem vs WORK_MEM) into patroni yaml or into cluster config, it would be resolved only at the last stage and for example you will be able to see both values if you use `patronictl edit-config`.
Fixes https://github.com/zalando/patroni/issues/649