2335 Commits

Author SHA1 Message Date
Alexander Kukushkin
65b43c39fa Release/v3.2.1 (#2968)
- bump version
- bump pyright
- update release notes
v3.2.1
2023-11-30 16:51:21 +01:00
Waynerv
aea9a2b0ca Cache postgres --describe-config output results (#2967)
We don't expect GUCs list to change for the same major version and don't expect major version to change while Patroni is running.
2023-11-30 12:07:06 +01:00
Sophia Ruan
71ccd41915 Fix the issue that REST API returns unknown after postgres restart (#2956)
Close #2955
2023-11-30 10:16:51 +01:00
Alexander Kukushkin
49e4a6ed7d Fix Citus transaction rollback condition check (#2964)
It seems that sometimes we get an exact match, what makes behave tests to fail.
2023-11-30 09:02:50 +01:00
Alexander Kukushkin
ebd05871d9 Bump pyright to 1.1.336 (#2952)
and fix newly reported issues
2023-11-30 09:02:16 +01:00
Alexander Kukushkin
42cd803619 Fix bug with custom bootstrap (#2948)
Patroni was falsely applying `--command` argument.

Close https://github.com/zalando/patroni/issues/2947
2023-11-30 09:01:47 +01:00
Alexander Kukushkin
bae72df5b1 Fix pg_rewind behavior with Postgres v16+ (#2944)
The error message format was changed in
4ac30ba4f2, what caused `pg_rewind` being called by Patroni even when it was not necessary.
2023-11-30 09:01:41 +01:00
Alexander Kukushkin
f2a129f209 Fix Etcd v2 with Citus (#2943)
When deploying a new Citus cluster with Etcd v2 Patroni was failing to start with the following exception:
```python
2023-11-09 10:51:41,246 INFO: Selected new etcd server http://localhost:2379
Traceback (most recent call last):
  File "/home/akukushkin/git/patroni/./patroni.py", line 6, in <module>
    main()
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 343, in main
    return patroni_main(args.configfile)
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 237, in patroni_main
    abstract_main(Patroni, configfile)
  File "/home/akukushkin/git/patroni/patroni/daemon.py", line 172, in abstract_main
    controller = cls(config)
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 66, in __init__
    self.ensure_unique_name()
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 112, in ensure_unique_name
    cluster = self.dcs.get_cluster()
  File "/home/akukushkin/git/patroni/patroni/dcs/__init__.py", line 1654, in get_cluster
    cluster = self._get_citus_cluster() if self.is_citus_coordinator() else self.__get_patroni_cluster()
  File "/home/akukushkin/git/patroni/patroni/dcs/__init__.py", line 1638, in _get_citus_cluster
    cluster = groups.pop(CITUS_COORDINATOR_GROUP_ID, Cluster.empty())
AttributeError: 'Cluster' object has no attribute 'pop'
```

It is broken since #2909.

In addition to that fix `_citus_cluster_loader()` interface by allowing it to return only dict obj.
2023-11-30 09:01:19 +01:00
Alexander Kukushkin
df0fd91614 Do a real http request when performing name uniqueness check (#2942)
When running in containers it is possible that the traffic is routed using `docker-proxy`, which listens on the port and accepting incoming connections.

This commit effectively sticks to the original solution from #2878
2023-11-30 09:01:11 +01:00
Alexander Kukushkin
43f23df974 Verify that replica nodes received checkpoint LSN on shutdown (#2939)
In case if archiving is enabled the `Postgresql.latest_checkpoint_location()` method returns LSN of the prev (SWITCH) record, which points to the beginning of the WAL file. It is done in order to make it possible to safely promote replica which recovers WAL files from the archive and wasn't streaming when the primary was stopped (primary doesn't archive this WAL file).

But, in certain cases using the LSN pointing to SWITCH record was causing unnecessary pg_rewind, if replica didn't managed to replay shutdown checkpoint record before it was promoted.

In order to mitigate the problem we need to check that replica received/replayed exactly the shutdown checkpoint LSN. But, at the same time we will still write LSN of the SWITCH record to the `/status` key when releasing the leader lock.
2023-11-30 09:01:05 +01:00
Alexander Kukushkin
42bf1f95a3 Limit accepted values for --format argument (#2938)
It used to accept any arbitrary string

Close https://github.com/zalando/patroni/issues/2936
2023-11-30 09:00:39 +01:00
Israel
23200daada Add a FAQ page to the docs (#2933)
This commit introduces a FAQ page to the docs. The idea is to get
most frequently asked questions answered before-hand, so the user
is able to get them answered quickly without going into detail in
the docs or having to go to Slack/GitHub to clarify questions.

---------
Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
2023-11-30 09:00:22 +01:00
Alexander Kukushkin
ce10e5fccc Release v3.2.0 (#2930)
- bump version
- bump pyright and apply fixes
- update release notes
v3.2.0
2023-10-25 16:13:30 +02:00
Israel
bb90feb393 Add support for additional parameters on custom bootstrap (#2927)
Previous to this commit, if a user would ever like to add parameters to the custom bootstrap script call, they would need to configure Patroni like this:

```
bootstrap:
  method: custom_method_name
  custom_method_name:
    command: /path/to/my/custom_script --arg1=value1 --arg2=value2 ...
```

This commit extends that so we achieve a similar behavior that is seen when using `create_replica_methods`, i.e., we also allow the following syntax:

```
bootstrap:
  method: custom_method_name
  custom_method_name:
    command: /path/to/my/custom_script
    arg1: value1
    arg2: value2
```

All keys in the mapping which are not recognized by Patroni, will be dealt with as if they were additional named arguments to be passed down to the `command` call.

References: PAT-218.
2023-10-25 15:01:08 +02:00
Alexander Kukushkin
3d527f5728 Improve formatting of generated config and validation of ints (#2928)
- order sections similar to sample configs
- add warnings and comments to `bootstrap.dcs` section.
- add `tags` and `log` sections.
- use discovered IPs in `postgresql.connect_address` and `postgresql.listen`
- set `wal_level` to `replica` for PostgreSQL 9.6+
- make unit tests pass with python 3.6
- improve config validator so it doesn't complain when some ints are strings in YAML file.
2023-10-25 14:23:57 +02:00
Polina Bungina
6c06f5cc96 Add initial docs for patroni --validate/generate config (#2929)
For now it will sit in the section about the Patroni configuration. We can later move it to (or reference from) a new section where all the functionality of the `patroni` executable will be described.
2023-10-25 14:20:17 +02:00
Mark Pekala
f5ee67fa1c Feature: failover priority (#2780)
The priority is configured with `failover_priority` tag. Possible values are from `0` till infinity, where `0` means that the node will never become the leader, which is the same as `nofailover` tag set to `true`. As a result, in the configuration file one should set only one of `failover_priority` or `nofailover` tags.

The failover priority kicks in only when there are more than one node have the same receive/replay LSN and are ahead of other nodes in the cluster. In this case the node with higher value of `failover_priority` is preferred. If there is a node with higher values of receive/replay LSN, it will become the new leader even if it has lower value of `failover_priority` (except when priority is set to 0).

Close https://github.com/zalando/patroni/issues/2759
2023-10-24 12:22:48 +02:00
Israel
65030c56ee Add capability of specifying namespace through --dcs argument (#2926)
This commit changes the `patronictl` application in such a way its
`--dcs` argument is now able to receive a namespace.

Previous to this commit this was the format of that argument's value:
`DCS://HOST:PORT`.

From now on it accepts this format: `DCS://HORT:PORT/NAMESPACE`. As all
previous parts of the argument value, `NAMESPACE` is optional, and if
not given `patronictl` will fallback to the value from the configuration
file, if any, or to `service`.

This change is specifically useful when you are running a cluster in a
custom namespace, and from a machine where you don't have a configuration
file for Patroni or `patronictl`. It can avoid that you would have to
create a configuration file only with `namespace` filed in that case.

Issue reported by: Shaun Thomas <shaun@bonesmoses.org>

Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
2023-10-24 12:09:44 +02:00
Alexander Kukushkin
d471f1156d Handle AuthOldRevision error (#2913)
The error is raised if Etcd is configured to use JWT auth tokens and when the user database in Etcd is updated, because the update invalidates all tokens.

If retries are requested - try to get a new new token and repeat the request. Repeat it in a loop until request is successfully executed or until `retry_timeout` is exhausted. This is the only way of solving a race condition, because between authentication and executing the request yet another modification of the user database in Etcd might happen.

In case if the request doesn't have to be immediately retried - set a flag that the next API request should perform the authentication first and let Patroni to naturally repeat the request on the next heartbeat loop.

Co-authored-by: Kenny Do <kedo@render.com>
Ref: https://github.com/zalando/patroni/pull/2911
2023-10-23 14:00:37 +02:00
Alexander Kukushkin
6d98944e73 Add warning to the sample config about bootstrap section (#2925)
often people are trying to change it and coming with the questions why it doesn't work.
2023-10-23 10:03:18 +02:00
zhjwpku
6cfd90401e get rid of stale comment of get_cluster (#2922)
PR #2909 remove the cache in Zookeeper implementation of DCS, so
the comment of get_cluster should be changed to 'Retrieve a fresh
view of DCS' since every implementation does so.

Signed-off-by: Zhao Junwang <zhjwpku@gmail.com>
2023-10-23 08:30:13 +02:00
GuanqunYang193
ce187bec38 Remove user creation related docs (#2920)
* Remove user creation related docs
* remove template
2023-10-23 08:29:09 +02:00
Alexander Kukushkin
c5fffb3c97 Further work on permanent physical slots (#2891)
- Fixed issues with has_permanent_slots() method. It didn't took into account the case of permanent physical slots for members, falsely concluding that there are no permanent slots.
- Write to the status key only LSNs for permanent slots (not just for slots that exist on the primary).
  - Include pg_current_wal_flush_lsn() to slots feedback, so that slots on standby nodes could be advanced
- Improved behave tests:
  - Verify that permanent slots are properly created on standby nodes
  - Verify that permanent slots are properly advanced, including DCS failsafe mode
  - Verify that only permanent slots are written to the `/status`
2023-10-23 08:24:28 +02:00
zhjwpku
cb5f34b721 add some guide to run tests in different scopes (#2921)
Introduce ways to run tests in different scopes which should be helpful for beginners.
2023-10-23 08:17:53 +02:00
zhjwpku
260ab36f2e mock getaddrinfo in case test failure (#2918)
Close #2915
2023-10-17 19:53:19 +02:00
Alexander Kukushkin
fc67ba73f0 Allow to specify psycopg* in extras and switch to build (#2907)
* remove check_psycopg() call from the setup.py, when installing from wheel it doesn't work anyway.
* call check_psycopg() function before process_arguments(), because the last one is trying to import psycopg and fails with the stacktrace, while the first one shows a nice human-readable error message.
* add psycopg2, psycopg2-binary, and psycopg3 extras, that will install psycopg2>=2.5.4, psycopg2-binary, or psycopg[binary]>=3.0.0 modules respectively.
* move check_psycopg() function to the __main__.py.
* introduce the new extra called `all`, it will allow to install all dependencies at once (except psycopg related).
* use the `build` module in order to create sdist bdist_wheel packages.
* update the documentation regarding psycopg and extras (dependencies).
2023-10-17 14:46:15 +02:00
GuanqunYang193
60d8bc3a70 Add warning of removing user creation (#2893) 2023-10-17 13:04:59 +02:00
Alexander Kukushkin
e513f7f127 Attempt to reduce flakiness for recovery behave test on K8s (#2917)
wait until Postgres is properly started after the first crash before changing `primary_start_timeout` and killing it once again.
2023-10-17 11:27:41 +02:00
Alexander Kukushkin
aa3ebe0af8 Don't cache anything in Zookeeper implementation (#2909)
Cache creates a lot of problems and prevents implementing a feature of automatic retention of physical replication slots for members with configurable retention policy.

Just read the entire cluster from Zookeeper instead and use watchers only for the `/leader` and `/config` keys.
2023-10-17 08:56:31 +02:00
Alexander Kukushkin
c96e35c807 Enable Citus behave tests for Postgres v16 (#2914)
and reduce flakiness
2023-10-16 16:05:27 +02:00
André Litfin
88b35252c3 Update README.md to reflect changes in etcd v3 (#2912)
In etcdctl v3 the ls command isn't present anymore, it has to be changed to etcdctl get --keys-only --prefix
2023-10-16 15:18:25 +02:00
Alexander Kukushkin
d93db20baa Set citus.local_hostname (#2903)
There are cases when Citus wants to have a connection to the local postgres. By default it uses `localhost` for that, which is not alwasy available. To solve it we will set `citus.local_hostname` GUC to custom value, which is the same as Patroni uses to connect to Postgres.
2023-10-16 10:21:50 +02:00
Alexander Kukushkin
42976df86f Make it easier to debug callbacks (#2902)
1. Introduce DEBUG logs for callbacks
2. Configure log format in behave tests to include filename, line, and method name that triggered the callback and enable DEBUG logs for `patroni.postgresql.callback_executor` module.

P.S. unfortunately it works only starting from python 3.8, but it should be good enough for debug purpose because 3.7 is already EOL.
2023-10-16 08:55:07 +02:00
zhjwpku
6f4c2fe132 %s/iter_dcs_modules/iter_dcs_classes/g (#2905) 2023-10-11 13:17:18 +02:00
Chris Bandy
588df5da05 Refine the documentation about custom_conf (#2901)
some back icks in this section needed to be balanced.
2023-10-11 08:41:11 +02:00
Polina Bungina
fb367cd73e Change cb checks in standby cluster behave test (#2899)
fix and extend callback content checks
2023-10-10 13:49:52 +02:00
Alexander Kukushkin
535dc631ec Bugfix: standby cluster switchover (#2900)
1.  Enforce `_load_cluster()` after acquisition for the leader lock in ZooKeeper. Sometimes the notification from ZooKeeper was arriving too late and Patroni wasn't setting the `role=standby_leader`.

2. The `_get_node_to_follow()` method was falsely assuming that we still own the leader lock and returning the remote node instead of the new standby leader. While not a big issue per se, because the next HA loop usually fixes it, such behavior was causing flakiness of behave tests with Postgres 12 and older, where restart is required to update `primary_conninfo` GUC.
2023-10-10 12:21:19 +02:00
Alexander Kukushkin
9b8c40a6e1 Start thread that will handle SIGCHLD for on_reload callback (#2898)
Close #2897
2023-10-10 09:54:24 +02:00
Alexander Kukushkin
e19a8730ea Take IP from the pod if kubernetes.pod_ip is missing (#2895)
It used to work before #2652

Besides that fix a couple of more problems:
- make sure `_patch_or_create()` method isn't instantiating the `k8s_client.V1ConfigMap` object instead of `k8s_client.V1Endpoints` for non leader endpoints. The only reason it worked is that the JSON serialization for both object types is the same and doesn't include the object type name.
- `attempt_to_acquire_leader()` should immediately put the IP address of the primary to the leader endpoint. It didn't happen because of the oversight in the https://github.com/zalando/patroni/pull/1820.
2023-10-09 10:43:43 +02:00
Israel
28a604983b Enhancement to tox behave tests (#2889)
* Add `etcd3` as a DCS option for behave tests in `tox.ini`

Currently behave tests run through `tox` accept only `etcd` as a DCS.

This commit adds the option of using `etcd3` too.

* Add JSON report to `tox` behave tests

This commit adds a JSON report when running behave tests through
`tox`.

That makes it easier to parse the results.

---------

Signed-off-by: Israel Barth Rubio <israel.barth@enterprisedb.com>
2023-10-06 10:48:55 +02:00
Polina Bungina
efacc6c16b Ignore synchronous_mode setting in a standby cluster (#2896)
is_synchronous_mode() should always return False in standby clusters
2023-10-06 10:21:37 +02:00
Alexander Kukushkin
9283ebda64 Enforce loop_wait/retry_timeout/ttl rule (#2869)
* hard-code minimal possible values
* make adjustments if values are lower or if the rule is violated and show warnings
* update documentation
2023-10-04 11:44:57 +02:00
Israel
a329a9d320 Add a documentation page for patronictl (#2874)
This PR introduces a documentation page for `patronictl` application.

We adopted a top-down approach when writing this document. We start by describing the outer most parts, and then keep writing new sections that specialize the knowledge.

We basically added a section called `patronictl` to the left menu. Inside that section we created a page with this structure:

- `patronictl`: describes what it is
    - `Configuraiton`: how to configure `patronictl`
    - `Usage`: how to use the CLI. Inside this section, there are subsections for each of the subcommands exposed by `patronictl`, and each of them are described using the following subsubsections:
        - `Synopsis`: syntax of the command and its positional and optional arguments
        - `Description`: a description of what the command does
        - `Parameters`: a detailed description of the arguments and how to use them
        - `Examples`: one or more examples of execution of the command

References: PAT-200.
2023-10-04 11:43:38 +02:00
Alexander Kukushkin
f77073c8e1 Speed up dcs failsafe behave tests (#2890)
- get rid from sleeps
- reduce retry_timeout
- avoid graceful Patroni shut down while DCS is "paused", just kill
  Patroni and after that gracefully stop postgres
- don't try to delete Pod when Patroni is killed. If K8s API is paused it takes ages

The run time on my laptop is reduced from 2m to 1m28s.
2023-09-28 10:44:11 +02:00
Polina Bungina
aaac6f6fb0 Don't fail if pg_hba/pg_ident contain comment lines (#2888)
yaml parser interprets such lines as null and stores it as None into the
array of the parsed values, which can not be handled by write() function
and crashes the whole bootstrap process.
Even though it is not the proper value, it won't hurt if we just ignore it instead of failing completely.
2023-09-27 15:57:09 +02:00
Polina Bungina
27915984b4 Add contrib requirement for tests, small docs refactoring (#2887) 2023-09-27 12:19:58 +02:00
Polina Bungina
220cacd95f Don't call socket functions from tests (#2886)
We used to call `socket` module's functions from the config_generator tests
to later compare with the output produced by --generate-config. That
however sometimes ends up with the whole test module failure if gethostname()
returned None.
Also includes a little code deduplication (NO_VALUE_MSG imported directly from the config_generator module)
and removes debug maxDiff option
2023-09-26 15:52:00 +02:00
Alexander Kukushkin
a3b3e1bc1c Release v3.1.2 (#2885)
- bump version
- update release notes
2023-09-26 12:30:27 +02:00
Alexander Kukushkin
c855b0bff9 Detect and solve inconsistency between /sync and actual sync nodes (#2877)
Patroni is changing `synchronous_standby_names` and the `/sync` key in a very specific order, first we add nodes to `synchronous_standby_names` and only after, when they are recognized as synchronous they are added to the `/sync` key. When removing nodes the order is different: they are first removed from the `/sync` key and only after that from the `synchronous_standby_names`.

As a result Patroni expects that either actual synchronous nodes will match with the nodes listed in the `/sync` key or that new candidates to synchronous nodes will not match with nodes listed in the `/sync` key. In case if `synchronous_standby_names` was removed from the `postgresql.conf`, manually, or due the the bug (#2876), the state becomes inconsistent because of the wrong order of updates.

To solve inconsistent state we introduce additional checks and will update the `/sync` key with actual names of synchronous nodes (usually empty set).
2023-09-26 11:14:20 +02:00
Alexander Kukushkin
4c1c804cfd Read GUC's values when joining running Postgres (#2876)
If restarted in pause Patroni was discarding `synchronous_standby_names` from `postgresql.conf` because in the internal cache this values was set to `None`. As a result synchronous replication transitioned to a broken state, with no synchronous replicas according to the `synchronous_standby_names` and Patroni not selecting/setting the new synchronous replicas (another bug).

To solve the problem of broken initial state and to avoid similar issues with other GUC's we will read GUC's value if Patroni is joining running Postgres.
2023-09-26 10:40:51 +02:00