patroni

mirror of https://github.com/outbackdingo/patroni.git synced 2026-01-27 18:20:05 +00:00

Author	SHA1	Message	Date
Alexander Kukushkin	6616acff58	Postpone writing postgresql.conf when joining running Postgres 12+ (#1956 ) When joining already running Postgres, Patroni ensures that config files are set according to expectations. With recovery parameters converted to GUCs in Postgres v12 it became a little problem, because when the `Postgresql` object is being created it is not yet known where the given replica is supposed to stream from. It resulted in postgresql.conf first being written without recovery parameters, and on the next run of HA loop Patroni noticing inconsistencies and updating the config one more time. For Postgres v12 it is not a big issue, but for v13+ it resulted in interruption of streaming replication.	2021-06-30 09:11:12 +02:00
Alexander Kukushkin	f3420e2db5	Compatibility with PostgreSQL 14 (#1926 ) PostgreSQL 14 changed the behavior of replicas when certain parameters (like for example `max_connections`) are changed (increased): https://github.com/postgres/postgres/commit/15251c0a. Instead of immediately exiting Postgres 14 pauses replication and waits for actions from the operator. Since the `pg_is_wal_replay_paused()` returning `True` is the only indicator of such a change, Patroni on the replica will call the `pg_wal_replay_resume()`, which would cause either continue replication or shutdown (like previously). So far Patroni was never calling `pg_wal_replay_resume()` on its own, therefore, to remain backward compatible it will call it only for PostgreSQL 14+.	2021-06-25 13:41:45 +02:00
Alexander Kukushkin	448d703733	Explicitely request cluster version when connecting to etcd via proxy (#1974 ) Close https://github.com/zalando/patroni/issues/1971	2021-06-24 08:51:07 +02:00
Arman Jafari Tehrani	e48df9987d	Add health check on user defined tags (#1964 ) Close #1958	2021-06-23 08:30:10 +02:00
Michael Banck	00bf546848	Add scope as label to Prometheus metrics (#1978 )	2021-06-22 16:49:48 +02:00
Alexander Kukushkin	f403719bb4	Reduce chattiness of Patroni logs (#1955 ) 1. When everything goes normal, only one line will be written for every run of HA loop (see examples): ``` INFO: no action. I am (postgresql0) the leader with the lock INFO: no action. I am a secondary (postgresql1) and following a leader (postgresql0) ``` 2. The `does not have lock` became a debug message. 3. The `Lock owner: postgresql0; I am postgresql1` will be shown only when stream doesn't look normal.	2021-06-22 09:13:30 +02:00
Alexander Kukushkin	03e71b6717	The /leader endpoint returns 200 if node holds the lock (#1917 ) Promoting the standby cluster requires updating load-balancer health checks, which is not very convenient and easy to forget. In order to solve it, we change the behavior of the `/leader` health-check endpoint. It will return 200 without taking into account whether PostgreSQL is running as the primary or the standby_leader.	2021-06-22 08:21:29 +02:00
Alexander Kukushkin	e5bfd4f5ee	Compatibility with psycopg2 2.9+ (#1970 ) the autocommit = True is ignored in the `with connection` block	2021-06-17 14:29:10 +02:00
Alexander Kukushkin	2d504a4f0a	Copy the logical slot over if advance failed due to the missing WAL (#1946 ) It could happen that the replica for some reason is missing the WAL file required by the replication slot. The nature of this phenomenon is a bit unclear, it might be that the WAL was recycled short before we copied the slot file, but, we still need a solution to this problem. If the `pg_replication_slot_advance()` fails with the `UndefinedFile` exception (requested WAL segment pg_wal/... has already been removed), the logical slot on the replica must be recreated.	2021-06-02 16:57:13 +02:00
Alexander Kukushkin	eaa98e71e3	Fix bug with unix socket connections (#1933 ) When the unix_socket_directories is not known Patroni was immediately going back to tcp connection via the localhost. The bug was introduced in https://github.com/zalando/patroni/pull/1865	2021-05-10 09:53:25 +02:00
Alexander Kukushkin	99626a07f2	Fix issues with raft traffic encryption (#1919 ) and run raft behave tests with encryption enabled. Using the new `pysyncobj` release allowed us to get rid of a lot of hacks with accessing private properties and methods of the parent class and reduce the size of the `raft.py`. Close https://github.com/zalando/patroni/issues/1746	2021-04-30 11:28:41 +02:00
Tommy Li	294bb43bf1	Add Patronis default postgres configs to the sample yml files (#1909 ) This adds the default Postgres settings enforced by Patroni to the `postgres{n}.yml` files provided in the repo. The documentation does call out the defaults that Patroni will set, but it can be missed if you download postgres0.yml and use that as a starting point. Hopefully the extra commented out configs serve as a visual cue to save the next person from the same mistake :)	2021-04-20 09:45:43 +02:00
Takuya N	b52f458a93	Update link for tests status on GH Actions (#1903 ) Follows #1778	2021-04-20 09:43:37 +02:00
melrifa	6d6b504cb8	Add support for patroni replication user socket connection (#1865 ) Close #1866	2021-04-20 09:43:05 +02:00
Alexander Kukushkin	3ae459c6d5	Get rid of false warning about invalid parameter (#1908 ) Despite all recovery parameters became GUCs in PostgreSQL 12, there are very good reasons to keep them separated in the Patroni internals. While implementing PostgreSQL parameters validation in #1674 one little oversight occurred. The parameters validation happens before the recovery parameters are skipped from the list, which produces a false warning. Close https://github.com/zalando/patroni/issues/1907	2021-04-20 09:40:22 +02:00
Alexander Kukushkin	51cda9fb6e	Fix excessive HA loop runs with Zookeeper (#1875 ) 1. Commit `04b9fb9dd4` introduced additional conditions for updating cached version of the leader optime. It was required for implementing health-checks based on replication lag in the https://github.com/zalando/patroni/pull/1599. What in fact was forgotten, the event should be cleared after the new value of the optime was fetched. Not doing so results in running the HA loop more frequently than is required. 2. Don't watch for sync members. The watch for sync member(s) was introduced in order to give a signal to the leader that one of the members set the `nosync` tag to true. Since that time we have got a few more conditions that should be notified about, therefore instead of watching for all members of the cluster every cluster member checks whether the condition is met, and instead of updating ZNode performs delete+create. Since every member is already watching for new ZNodes to be created inside the $scope/members/, they automatically get notified about important changes, and therefore watching for sync members is redundant. 3. In addition to that, slightly increase watch timeout, it will keep HA loops in sync across all nodes in the cluster. Close https://github.com/zalando/patroni/pull/1873	2021-03-29 08:08:26 +02:00
Michael Todorovic	3dbe6a542a	fix: reload api if certificate changed on disk (#1887 ) This PR fixes #1886. We get the certificate serial number on server startup and store it in `api.__ssl_serial_number` On reload, we get again the serial number from disk and compare it to the one stored in `api.__ssl_serial_number`: if different, then the api will be reloaded (even if the config file didn't change)	2021-03-29 08:07:48 +02:00
Kostiantyn Nemchenko	e15b809ed3	Don't create pgpass dir if kerberos auth is used (#1888 ) Fixes #1551	2021-03-29 08:07:02 +02:00
Alexander Kukushkin	9edbe7e3f7	Fix little issues with custom bootstrap (#1891 ) 1. Set hot_standby=off only when we do PITR 2. Restart postgres after PITR is done to avoid warnings 3. Address invalid config issue https://github.com/zalando/patroni/issues/1870#issuecomment-800088643	2021-03-29 08:06:12 +02:00
Alexander Kukushkin	c7173aadd7	Failover logical slots (#1820 ) Effectively, this PR consists of a few changes: 1. The easy part: In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots. In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster. 2. The tricky part: On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()` function. When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward. 3. Additional requirements: In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots. 4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use. Compatibility. The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary. The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed. Close: https://github.com/zalando/patroni/issues/1749	2021-03-25 16:18:23 +01:00
Mark Mercado	09f2f579d7	Quick attempt at Prometheus (#1848 ) Close https://github.com/zalando/patroni/issues/318	2021-03-04 12:37:29 +01:00
Alexander Kukushkin	b341ab2e2f	Release 2.0.2 (#1851 ) * bump version * update release notes * implement missing unit-test v2.0.2	2021-02-22 12:28:19 +01:00
Alex Brasetvik	82b918e10d	Constant time comparison of auth key (#1847 ) … to avoid [timing attacks](https://codahale.com/a-lesson-in-timing-attacks/).	2021-02-19 11:16:41 +01:00
Mark Mercado	77f08af682	Create raft data_dir if necessary (#1846 ) Close #1760	2021-02-18 11:35:31 +01:00
Mark Mercado	32daef3939	Fixing grammar and typos (#1845 )	2021-02-18 10:28:38 +01:00
Alexander Kukushkin	b698df374f	Fix build (#1843 ) run apt-get update before installing packages	2021-02-16 09:34:36 +01:00
Alexander Kukushkin	9f252d246e	Improve handling of concurrent update error (#1796 ) The old strategy was waiting for 1 second and hoping that we will get an update event from the WATCH connection. Unfortunately, it didn't work well in practice. Instead, we will get the current value from the API by performing an explicit read request. Close https://github.com/zalando/patroni/issues/1767	2021-02-11 15:55:05 +01:00
Alexander Kukushkin	39332c93ed	Treat PATRONI_KUBERNETES_USE_ENDPOINTS env as boolean (#1832 ) Close https://github.com/zalando/patroni/issues/1814	2021-02-03 09:44:59 +01:00
krishna	b3dc765e6d	Choose synchronous nodes based on replication lag (#1786 ) This commit makes it possible to configure the maximum lag (`maximum_lag_on_syncnode`) after which Patroni will "demote" the node from synchronous and replace it with another node. The previous implementation always tried to stick to the same synchronous nodes (even if they are not optimal ones).	2021-02-02 15:45:02 +01:00
Kaarel Moppel	9d7d4423e3	Docs: document the need for special configuration for symlinked pg_wal (#1818 ) If an existing instance was configured with WAL residing outside of PGDATA then currently a 'reinit' would lose such symlinks. So add some bits of information on that to draw attention to this cornercase issue and also add the --waldir option to the sample `postgresql.basebackup` configuration sections to increase visibility. Discussion: https://github.com/zalando/patroni/issues/1817	2021-02-02 11:50:35 +01:00
Doug Whitfield	9c5c0e71c2	Add a new section to end on testing HA solution (#1827 ) Thanks to @ants for the suggestion and some tips on testing via slack.	2021-02-02 11:49:50 +01:00
Alexander Kukushkin	cdfc4ea50f	Handle case with psutil cmdline() returning empty list (#1829 ) Close https://github.com/zalando/patroni/issues/1828	2021-02-02 11:48:44 +01:00
Alexander Kukushkin	6bf205b190	Don't use bypass_api_service when running patronictl (#1830 ) It could happen that the cluster role wither not configured or doesn't provide enough permissions. In this case bypass_api_service is ignored, but the warning is logged, which is rather annoying when patronictl is used. Since the bypass_api_service is most useful for Patroni, we will simply ignore it when patronictl is used.	2021-02-02 11:47:46 +01:00
Jonathan S. Katz	accba93cbe	Add support for encrypted TLS keys for REST API (#1825 ) The Python SSL library allows for the inclusion of a password in its "load_cert_chain" function when setting up a SSLContext[1]. This allows for loading an encrypted key file in PEM representation to be loaded into the certificate chain. This commit adds the optional "keyfile_password" parameter to the REST API block of configuration so that Patroni can load in encrypted private keys when establishing its TLS socket. This also adds the corollary "PATRONI_RESTAPI_KEYFILE_PASSWORD" environmental variable, which has the same effect. [1] https://docs.python.org/3/library/ssl.html#ssl.SSLContext.load_cert_chain	2021-02-02 11:47:09 +01:00
Gunnar "Nick" Bluth	ba4ab58d40	Support cipher suite limitation for REST API (#1824 ) Many environments require a limitation of allowed TLS cipher suites / levels. See e.g. the german BSI requirements: https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/Publications/TechGuidelines/TG02102/BSI-TR-02102-2.pdf?__blob=publicationFile&v=10 This implements an optional "ciphers" setting that - if given - enforces the ciphers on the REST API socket. See also #1730.	2021-01-27 13:53:28 +01:00
Alexander Kukushkin	8b5cb85536	Exit only if authentication explicitly failed (#1806 ) It could happen that one of etcd servers is not accessible on Patroni start. In this case Patroni was trying to perform authentication and exiting, while it should exit only if Etcd explicitly responded with the `AuthFailed` error. Close https://github.com/zalando/patroni/issues/1805	2021-01-15 14:31:33 +01:00
Alexander Kukushkin	a9f86aa195	Add compatibility with python-consul2 (#1812 ) the good old python-consul is not maintained for a few years in a row, therefore someone forked under a different name, but package files are installed into the same location as for the old. The API of both modules is mostly compatible therefore it wasn't hard to add the support of both modules in Patroni. Taking into account that python-consul is not a direct requirement for Patroni, but extra, now the end-user has a choice what to install. Close https://github.com/zalando/patroni/issues/1810	2021-01-15 14:30:48 +01:00
Alexander Kukushkin	4a8c4cfc53	Make tests more reliable (#1808 ) 1. Fix flaky behave tests with zookeeper. First, install/start binaries (zookeeper/localkube) and only after that continue with installing requirements and running behave. Previously zookeeper didn't had enough time to start and tests sometimes were failing. 2. Fix flaky raft tests. Despite observations of MacOS slowness, for some unknown reason the delete test with a very small timeout was not timing out, but succeeding, causing unit-tests to fail. The solution - do not rely on the actual timeout, but mock it.	2021-01-15 14:29:55 +01:00
Alexander Kukushkin	8446077fb3	Fixes around pg_rewind (#1794 ) 1. If the superuser name is different from postgres, the pg_rewind in the standby cluster was failing because the connection string didn't contain the database name. 2. Provide output if the single-user mode recovery failed. Close https://github.com/zalando/patroni/pull/1736	2020-12-16 19:54:19 +01:00
Alexander Kukushkin	94b9f8fae6	Silence unhandled exceptions in Thread.run() during unit-tests (#1802 ) Python 3.8 changed the way how exceptions raised from the Thread.run() method are handled. It resulted in unit-tests showing a couple of warnings. They are not important and we just silence them.	2020-12-16 19:37:51 +01:00
Alexander Kukushkin	df9cadcdc4	Handle AttributeError in etcd.py (#1801 ) When the__del__() method is executed the python interpreter already unloaded some of the modules that are still used down the http.clear() method. The only we could do in this case - silence some exceptions like ReferenceError, TypeError, AttributeError. Close https://github.com/zalando/patroni/issues/1785	2020-12-16 19:23:56 +01:00
Alexander Kukushkin	9b263dc6c9	Move find_executable to utils (#1799 ) We don't want to import the whole patroni.ctl into the patroni	2020-12-16 18:58:54 +01:00
Alexander Kukushkin	3a87d0e99b	Implement missing validators for etcd3 and raft (#1798 ) Close https://github.com/zalando/patroni/issues/1771	2020-12-16 18:44:58 +01:00
Alexander Kukushkin	89a15a2df4	Fix small issues with ignore-slots feature (#1797 ) When there is no config key in DCS Patroni shouldn't try accessing ignore_slots, otherwise an exception is raised. In addition to that implement missing unit-tests and fix linting issues in behave tests.	2020-12-16 18:10:12 +01:00
Alexander Kukushkin	61bf412ac6	Fix bug in the post_bootstrap method() (#1795 ) The config.write_pgpass(r) should be called unconditionally, no matter whether the password defined in the config or not. Close https://github.com/zalando/patroni/issues/1727	2020-12-16 17:33:44 +01:00
Nicolas Thauvin	492f755bd5	Warn the user when the required watchdog is not healthy (#1779 ) When the watchdog device is not writable or missing in required mode, the member cannot be promoted. It only keeps logging that it is not the healthiest member. Add a warning to show the user where to search for this misconfiguration.	2020-12-16 16:46:14 +01:00
Igor Yanchenko	583fccfb34	Start postgres with hot_standby=off when doing PITR (#1788 ) When doing a custom bootstrap the configured value of max_connections could be smaller than the actual value stored in the pg_contol. As a workaround Patroni already did an adjustment of such parameters. Unfortunately, it doesn't cover the case when the parameter value was increased even higher during the WAL replay. That was causing postgres to stop. The hot_standby=off disable connections to the instance, but allows postgres to run with parameter values smaller than required. The hot_standby=off doesn't affect the running primary. When the recovery finishes and postgres promotes it starts accepting connections as usual. If the node will be demoted, Patroni will stop postgres and start it up as replica with hot_standby=on.	2020-12-14 15:59:46 +01:00
Pascal GOUHIER	53b525202c	Add existing_data to toc (#1790 )	2020-12-14 15:53:02 +01:00
Kaarel Moppel	464019eaf7	Mention currently supported PostgreSQL versions (#1777 )	2020-12-14 15:51:06 +01:00
andrewlecuyer	37bff48915	Fix Invalid os.symlink Calls when Moving Data Dir (#1781 ) Fixes #1780 Specifically fixes the calls to `os.symlink` within the `move_data_directory` function, as used to update the symlink(s) for any WAL or tablespace directories following an init failure. The new name for a WAL and/or tablespace directory is now passed in as in as the first argument when calling `os.symlink`, while the second argument is now the name of the symlink that is being recreated. This aligns with the Python documentation for `os.symlink`, which states the following regarding the first two arguments for the `os.symlink` function (arguments `src` and `dst`): > Create a symbolic link pointing to src named dst (https://docs.python.org/3/library/os.html#os.symlink) With this change, all WAL and/or tablespaces directories will be properly renamed following a init failure, and the `FileExistsError` that would otherwise occur when attempting to recreate the associated symlinks is no longer thrown, e.g.: ```bash 2020-12-05 20:35:42.219 GMT [262] LOG: unrecognized configuration parameter "invalid" in file "/pgdata/mycluster1/postgresql.auto.conf" line 5 2020-12-05 20:35:42.219 GMT [262] FATAL: configuration file "/pgdata/mycluster1/postgresql.auto.conf" contains errors 2020-12-05 20:35:43,220 ERROR: postmaster is not running 2020-12-05 20:35:43,223 INFO: removing initialize key after failed attempt to bootstrap the cluster 2020-12-05 20:35:43,292 INFO: renaming user defined tablespace directory and updating symlink: /tablespaces/ts1/ts1 2020-12-05 20:35:43,314 INFO: renaming data directory to /pgdata/mycluster1_2020-12-05-20-35-43 Traceback (most recent call last): ``` Additionally, any/all WAL and/or tablespace symlinks are properly recreated, e.g.: ```bash $ ls -l pg_tblspc total 0 lrwxrwxrwx. 1 postgres postgres 40 Dec 5 20:35 16414 -> /tablespaces/ts1/ts1_2020-12-05-20-35-43 ```	2020-12-14 15:50:14 +01:00

1 2 3 4 5 ...

2020 Commits