Commit Graph

120 Commits

Author SHA1 Message Date
Alexander Kukushkin
59ecfb1799 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2024-01-05 10:22:35 +01:00
Alexander Kukushkin
bcfd8438a5 Abstract CitusHandler and decouple it from configuration (#2950)
the main issue was that the configuration for Citus handler and for DCS existed in two places, while ideally AbstractDCS should not know many details about what kind of MPP is in use.

To solve the problem we first dynamically create an object implementing AbstractMPP interfaces, which is a configuration for DCS. Later this object is used to instantiate the class implementing AbstractMPPHandler interface.

This is just a starting point, which does some heavy lifting. As a next steps all kind of variables named after Citus in files different from patroni/postgres/mpp/citus.py should be renamed.

In other words this commit takes over the most complex part of #2940, which was never implemented.

Co-authored-by: zhjwpku <zhjwpku@gmail.com>
2023-12-21 08:58:26 +01:00
Alexander Kukushkin
13cc86f851 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-11-24 14:42:15 +01:00
Alexander Kukushkin
1b96ae9c0a Fix Etcd v2 with Citus (#2943)
When deploying a new Citus cluster with Etcd v2 Patroni was failing to start with the following exception:
```python
2023-11-09 10:51:41,246 INFO: Selected new etcd server http://localhost:2379
Traceback (most recent call last):
  File "/home/akukushkin/git/patroni/./patroni.py", line 6, in <module>
    main()
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 343, in main
    return patroni_main(args.configfile)
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 237, in patroni_main
    abstract_main(Patroni, configfile)
  File "/home/akukushkin/git/patroni/patroni/daemon.py", line 172, in abstract_main
    controller = cls(config)
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 66, in __init__
    self.ensure_unique_name()
  File "/home/akukushkin/git/patroni/patroni/__main__.py", line 112, in ensure_unique_name
    cluster = self.dcs.get_cluster()
  File "/home/akukushkin/git/patroni/patroni/dcs/__init__.py", line 1654, in get_cluster
    cluster = self._get_citus_cluster() if self.is_citus_coordinator() else self.__get_patroni_cluster()
  File "/home/akukushkin/git/patroni/patroni/dcs/__init__.py", line 1638, in _get_citus_cluster
    cluster = groups.pop(CITUS_COORDINATOR_GROUP_ID, Cluster.empty())
AttributeError: 'Cluster' object has no attribute 'pop'
```

It is broken since #2909.

In addition to that fix `_citus_cluster_loader()` interface by allowing it to return only dict obj.
2023-11-09 11:09:38 +01:00
Alexander Kukushkin
f5cb888f80 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-08-14 17:06:42 +02:00
Alexander Kukushkin
9209a5a133 Refactor delete_leader interface (#2810)
similar to https://github.com/zalando/patroni/pull/2690, but it helps mostly Consul implementation.
2023-08-11 10:19:29 +02:00
Alexander Kukushkin
1a0549d540 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-07-31 16:26:10 +02:00
Alexander Kukushkin
7e89583ec7 Please new flake8 (#2789)
it stopped liking lack of space character between `,` and `\`
```python
foo,\
    bar
```
2023-07-31 09:08:46 +02:00
Alexander Kukushkin
db8061ad29 Merge branch 'master' of github.com:zalando/patroni into feature/quorum-commit 2023-05-30 14:20:08 +02:00
Alexander Kukushkin
af8e5f0d0f Refactor update_leader interface (#2690)
pass reference to a last known leader object in order to avoid obtaining it from the `AbstractDCS.cluster` cache.

This change is useful for Consul, Etcd3 and Zookeeper implementations.
2023-05-25 14:21:05 +02:00
Alexander Kukushkin
2223553fe5 Introduce quorum field in the /sync key
Old keys remain the same for backward compatibility.
2023-05-11 11:15:07 +02:00
Alexander Kukushkin
7941c86775 Refactor write_sync_state() (#2669)
Make it return the new `SyncState` object in order to avoid reading the new cluster state in the Ha.process_sync_replication().

Now it is a small optimization, but it will become very handy in the quorum commit feature.
2023-05-11 09:58:15 +02:00
Alexander Kukushkin
76b3b99de2 Enable pyright strict mode (#2652)
- added pyrightconfig.json with typeCheckingMode=strict
- added type hints to all files except api.py
- added type stubs for dns, etcd, consul, kazoo, pysyncobj and other modules
- added type stubs for psycopg2 and urllib3 with some little fixes
- fixes most of the issues reported by pyright
- remaining issues will be addressed later, along with enabling CI linting task
2023-05-09 09:38:00 +02:00
Polina Bungina
3fe2a7868a Ignore D401 in flake8-docstrings (#2627)
* Ignore D401 in flake8-docstrings
* Fix newly reported flake8 issues, ignore the old W503 rule
* rely on concatenation of adjecent strings
* Format behave scripts
* Reformat ha.py according to new rules

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
2023-04-03 09:52:22 +02:00
Alexander Kukushkin
6f357a4e17 Factor out global configuration into a dedicated class (#2628)
It will help to avoid code duplications.
2023-04-03 08:09:29 +02:00
Alexander Kukushkin
ddac8683e6 Use config file as a fallback when all current etcd nodes failed (#2599)
If communication with etcd nodes failed it is logical to start from scratch, from nodes that are listed in the config. But, it could happen that config is in fact outdated and all nodes in the real cluster were replaced.

Previously we used to track whether config file was changed, which turned out not to work in all possible cases.
The new strategy is a bit more different - if communication with all nodes failed we will continue keeping the last know topology and at the same time will try to figure out the new one by merging two lists together, the cached list and the list from the config file.
2023-03-14 15:54:17 +01:00
Alexander Kukushkin
45e5ac2baf Remove patronictl scaffold (#2544)
The only reason for having it was a hacky way of running standby clusters.
2023-01-27 08:52:59 +01:00
Alexander Kukushkin
4872ac51e0 Citus integration (#2504)
Citus cluster (coordinator and workers) will be stored in DCS as a fleet of Patroni logically grouped together:
```
/service/batman/
/service/batman/0/
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/
/service/batman/1/
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m1
/service/batman/1/members/m2
...
```

Where 0 is a Citus group for coordinator and 1, 2, etc are worker groups.

Such hierarchy allows reading the entire Citus cluster with a single call to DCS (except Zookeeper).

The get_cluster() method will be reading the entire Citus cluster on the coordinator because it needs to discover workers. For the worker cluster it will be reading the subtree of its own group.

Besides that we introduce a new method  get_citus_coordinator(). It will be used only by worker clusters.

Since there is no hierarchical structures on K8s we will use the citus group suffix on all objects that Patroni creates.
E.g.
```
batman-0-leader  # the leader config map for the coordinator
batman-0-config  # the config map holding initialize, config, and history "keys"
...
batman-1-leader  # the leader config map for worker group 1
batman-1-config
...
```

Citus integration is enabled from patroni.yaml:
```yaml
citus:
  database: citus
  group: 0  # 0 is for coordinator, 1, 2, etc are for workers
```

If enabled, Patroni will create the database, citus extension in it, and INSERTs INTO `pg_dist_authinfo` information required for Citus nodes to communicate between each other, i.e. 'password', 'sslcert', 'sslkey' for superuser if they are defined in the Patroni configuration file.

When the new Citus coordinator/worker is bootstrapped, Patroni adds `synchronous_mode: on` to the `bootstrap.dcs` section.

Besides that, Patroni takes over management of some Postgres GUCs:
- `shared_preload_libraries` - Patroni ensures that the "citus" is added to the first place
- `max_prepared_transactions` - if not set or set to 0, Patroni changes the value to `max_connections*2`
- wal_level - automatically set to logical. It is used by Citus to move/split shards. Under the hood Citus is creating/removing replication slots and they are automatically added by Patroni to the `ignore_slots` configuration to avoid accidental removal.

The coordinator primary actively discovers worker primary nodes and registers/updates them in the `pg_dist_node` table using
citus_add_node() and citus_update_node() functions.

Patroni running on the coordinator provides the new REST API endpoint: `POST /citus`. It is used by workers to facilitate controlled switchovers and restarts of worker primaries.
When the worker primary needs to shut down Postgres because of restart or switchover, it calls the `POST /citus` endpoint on the coordinator and the Patroni on the coordinator starts a transaction and calls `citus_update_node(nodeid, 'host-demoted', port)` in order to pause client connections that work with the given worker.
Once the new leader is elected or postgres started back, they perform another call to the `POST/citus` endpoint, that does another `citus_update_node()` call with actual hostname and port and commits a transaction. After transaction is committed, coordinator reestablishes connections to the worker node and client connections are unblocked.
If clients don't run long transaction the operation finishes without client visible errors, but only a short latency spike.

All operations on the `pg_dist_node` are serialized by Patroni on the coordinator. It allows to have more control and ROLLBACK transaction in progress if its lifetime exceeding a certain threshold and there are other worker nodes should be updated.
2023-01-24 16:14:58 +01:00
Alexander Kukushkin
2ea0357854 DCS failsafe mode (#2379)
If enabled it will allow Patroni to cope with DCS outages.
In case of a DCS outage the leader tries to call all remaining members in the cluster via API and if all of them respond with success the leader will not be demoted.

The failsafe_mode could be enabled by running
```sh
patronictl edit-config -s failsafe_mode=true
```

or by calling the `/config` REST API endpoint.

Co-authored-by: Polina Bungina <bungina@gmail.com>
2023-01-13 13:35:05 +01:00
Alexander Kukushkin
92d3e1c167 Introduce the failsafe key in DCS (#2485)
Extracted from #2379
2022-12-13 11:35:06 +01:00
Alexander Kukushkin
6ad5fee99d Raise DCSError when communication with DCS fails (#2484)
Previously such an exception was raised only from the `get_cluster()` method, and now we will to do the same from the `update_leader()` and `attempt_to_acquire_leader()` methods.

These methods influence Postgres promotion and demotion and we want to make a difference between different types of failures. Specifically, if calls have failed because DCS isn't accessible or due to a timeout.

This commit is extracted from the #2379
2022-12-13 11:06:55 +01:00
Alexander Kukushkin
dc9ff4cb8a Release 2.1.2 (#2136)
* Implement missing unit-tests
* Bump version
* Update release notes
2021-12-03 15:49:57 +01:00
DavidPavlicek
195b8bf049 Support for ETCD SRV name suffix (#2029)
Add support for ETCD SRV name suffix as per description in ETCD dosc:

> The -discovery-srv-name flag additionally configures a suffix to the SRV name that is queried during discovery. Use this flag to differentiate between multiple etcd clusters under the same domain. For example, if discovery-srv=example.com and -discovery-srv-name=foo are set, the following DNS SRV queries are made:
> 
> _etcd-server-ssl-foo._tcp.example.com
> _etcd-server-foo._tcp.example.com

All test passes, but not been tested on the live ETCD system yet... Please, take a look and send feedback. 

Resolves #2028
2021-08-13 15:49:01 +02:00
Alexander Kukushkin
c7173aadd7 Failover logical slots (#1820)
Effectively, this PR consists of a few changes:

1. The easy part:
  In case of permanent logical slots are defined in the global configuration, Patroni on the primary will not only create them, but also periodically update DCS with the current values of `confirmed_flush_lsn` for all these slots.
  In order to reduce the number of interactions with DCS the new `/status` key was introduced. It will contain the json object with `optime` and `slots` keys. For backward compatibility the `/optime/leader` will be updated if there are members with old Patroni in the cluster.

2. The tricky part:
  On replicas that are eligible for a failover, Patroni creates the logical replication slot by copying the slot file from the primary and restarting the replica. In order to copy the slot file Patroni opens a connection to the primary with `rewind` or `superuser` credentials and calls `pg_read_binary_file()`  function.
  When the logical slot already exists on the replica Patroni periodically calls `pg_replication_slot_advance()` function, which allows moving the slot forward.

3. Additional requirements:
  In order to ensure that primary doesn't cleanup tuples from pg_catalog that are required for logical decoding, Patroni enables `hot_standby_feedback` on replicas with logical slots and on cascading replicas if they are used for streaming by replicas with logical slots.

4. When logical slots are copied from to the replica there is a timeframe when it could be not safe to use them after promotion. Right now there is no protection from promoting such a replica. But, Patroni will show the warning with names of the slots that might be not safe to use.

Compatibility.
The `pg_replication_slot_advance()` function is only available starting from PostgreSQL 11. For older Postgres versions Patroni will refuse to create the logical slot on the primary.

The old "permanent slots" feature, which creates logical slots right after promotion and before allowing connections, was removed.

Close: https://github.com/zalando/patroni/issues/1749
2021-03-25 16:18:23 +01:00
Alexander Kukushkin
94b9f8fae6 Silence unhandled exceptions in Thread.run() during unit-tests (#1802)
Python 3.8 changed the way how exceptions raised from the Thread.run() method are handled.
It resulted in unit-tests showing a couple of warnings. They are not important and we just silence them.
2020-12-16 19:37:51 +01:00
Alexander Kukushkin
3341c898ff Add Etcd v3 protocol support via api gRPC-gateway (#1162)
The only python-etcd3 client working directly via gRPC still supports only a single endpoint, which is not very nice for high-availability.

Since Patroni is already using a heavily hacked version of python-etcd with smart retries and auto-discovery out-of-the-box, I decided to enhance the existing code with limited support of v3 protocol via gRPC-gateway.

Unfortunately, watches via gRPC-gateway requires us to open and keep the second connection to the etcd.

Known limitations:
* The very minimal supported version is 3.0.4. On earlier versions transactions don't work due to bugs in grpc-gateway. Without transactions we can't do atomic operations, i.e. leader locks.
* Watches work only starting from 3.1.0
* Authentication works only starting from 3.3.0
* gRPC-gateway does not support authentication using TLS Common Name. This is because gRPC-proxy terminates TLS from its client so all the clients share a cert of the proxy: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md#using-tls-common-name
2020-07-31 14:33:40 +02:00
Alexander Kukushkin
04b9fb9dd4 Make sure cached last_leader_operation is up-to-date on replicas (#1600)
Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. So far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date:

* Etcd: since the `/optime/leader` key is updated right after the `/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old.

* ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old.

* Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg.

* Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, which might have negative side effects.

Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds.

Will help to implement #1599
2020-07-15 10:31:32 +02:00
Alexander Kukushkin
be4c078d95 Etcd smart refresh members (#1499)
In dynamic environments it is common that during the rolling upgrade etcd nodes are changing their IP addresses. If the etcd node where Patroni is currently connected to is upgraded last, it could happen that the cached topology doesn't contain any live node anymore and therefore request can't be retried and totally fails, usually resulting in demoting of the primary.

In order to partially overcome the problem, Patroni is already doing a periodic (every 5 minutes) rediscovery of the etcd cluster topology, but in case of very fast node rotation there was still a possibility to hit the issue.

This PR is an attempt to address the problem. If the list of nodes exhausted, Patroni will try to perform initial discovery via an external mechanism, like resolving A or SRV dns records and if the new list is different from the original, Patroni will use it as the new etcd cluster topology.

In order to deal with tcp issues the connect_timeout is set to max(read_timeout/2, 1). It will make list of members exhaust faster, but leaves the time to perform topology rediscovery and another attempt.

The third issue addressed by this PR - it could happen that dns names of etcd nodes didn't change, but ip addresses are new, therefore we clean up the internal dns cache when doing topology rediscovery.

Besides that, this commit makes `_machines_cache` property pretty much static, it will be updated only when the topology has changed and helps to avoid concurrency issues.
2020-04-23 12:51:05 +02:00
Alexander Kukushkin
90a4208390 Get rid from requests module (#1296)
It wasn't used for anything critical anyway, so it doesn't make a lot of sense to keep it as an explicit dependency.
2019-11-22 15:31:55 +01:00
Alexander Kukushkin
94b7ff656e Don't give up on retry too early (#1245)
Fixes https://github.com/zalando/patroni/issues/1195
2019-10-31 09:33:16 +01:00
Alexander Kukushkin
a4bd6a9b4b Refactor postgresql class (#1060)
* Convert postgresql.py into a package
* Factor out cancellable process into a separate class
* Factor out connection handler into a separate class
* Move postmaster into postgresql package
* Factor out pg_rewind into a separate class
* Factor out bootstrap into a separate class
* Factor out slots handler into a separate class
* Factor out postgresql config handler into a separate class
* Move callback_executor into postgresql package

This is just a careful refactoring, without code changes.
2019-05-21 16:02:47 +02:00
Alexander Kukushkin
4a4258fc3f Mock external resources (#995)
unit tests should not accidentally hit running Postgres, DCS or filesystem unless we want it explicitly.
2019-03-12 10:39:42 +01:00
Alexander Kukushkin
c64d51f79c Better support for static etcd cluster (#986)
if the `etcd.use_proxies` is set to true, Patroni will stick to the list of hosts specified in the `etcd.hosts` and avoid doing topology discovery. Such mode might be useful when you know that you connect to the etcd cluster via the set of proxies or when th etcd cluster has static topology.
2019-03-07 11:36:36 +01:00
Alexander Kukushkin
76d1b4cfd8 Minor fixes (#808)
* Use `shutil.move` instead of `os.replace`, which is available only from 3.3
*  Introduce standby-leader health-check and consul service
* Improve unit tests, some lines were not covered
* rename `assertEquals` -> `assertEqual`, due to deprecation warning
2018-09-19 16:32:33 +02:00
Alexander Kukushkin
5ce18a8045 Improve protection of DCS being accidentally wiped (#680)
We already have a lot of logic in place to prevent failover in such case and restore all keys, but an accidental removal of `/config` key was effectively switching off pause mode for 1 cycle of HA loop.
2018-05-18 11:18:58 +02:00
Alexander Kukushkin
89a11fed07 Don't rediscover etcd cluster topology when watch timed out (#630)
but switch to the next node if it is possible.

Fixes https://github.com/zalando/patroni/issues/628
2018-02-26 18:48:30 +01:00
Alexander Kukushkin
03c2a85d23 Expose current timeline in DCS and via API (#591)
It is very easy to get current timeline on the master by executing
```sql
SELECT ('x' || SUBSTR(pg_walfile_name(pg_current_wal_lsn()), 1, 8))::bit(32)::int
```

Unfortunately the same method doesn't work when postgres is_in_recovery. Therefore we will use replication connection for that on the replicas. In order to avoid opening and closing replication connection on every HA loop we will cache the result if its value matches with the timeline of the master.

Also this PR introduces a new key in DCS: `/history`. It will contain a json serialized object with timeline history in a format similar to the usual history files. The differences are:
* Second column is the absolute wal position in bytes, instead of LSN
* Optionally there might be a fourth column - timestamp, (mtime of history file)
2018-01-05 15:25:56 +01:00
Alexander Kukushkin
0e01bb33bb Improve patronictl reinit (#576)
Make it possible to cancel a running task if you want to reinitialize replica.
There are two possible ways to trigger it:
1. patronictl will ask whether you want to cancel already running task if an attempt to trigger reinitialize has failed
2. if you are using `--force` argument with `patronictl reinit`
2018-01-04 10:31:44 +01:00
Alexander Kukushkin
b6425cab85 Allow to specify multiple hosts for etcd (#589)
This list will be used for initial discovery of etcd cluster members.
If for some reason during work this list of hosts has been exhausted (during work), Patroni will return to initial list.

In addition to that improve ipv6 compatibility by using a special function for splitting host and port.

Fixes https://github.com/zalando/patroni/issues/523
2018-01-04 10:25:06 +01:00
Alexander Kukushkin
4328c15010 Make Patroni Kubernetes native (#500)
* Use ConfigMaps or Endpoins for leader elections and to keep cluster state
* Label pods with a postgres role
* change behavior of pip install. From now on it will not install all dependencies, you have to specify explicitly DCS you want to use Patroni with: `pip install patroni[etcd,zookeeper,kubernetes]`
2017-12-08 16:55:00 +01:00
Ants Aasma
856a13e24c Remove error spinning on etcd failure and reduce log spam (#429)
When all etcd servers refuse connections during watch the call will fail with an exception and will be immediately retried. This creates a huge amount of log spam potentially creating additional issues on top of losing the DCS. This patch takes note if etcd failures are repeating and starting from the second failure will sleep for a second before retrying. It additionally omits the stack trace after the first failure in a streak of failures.
2017-04-20 12:40:15 +02:00
Alexander Kukushkin
1ed91a93c6 Handle EtcdEventIndexCleared and EtcdWatcherCleared exceptions (#387)
If this case it doesn't make sense to retry, because it brings nothing
but produces a log of exceptions in the log...
2017-02-16 17:07:09 +01:00
Alexander Kukushkin
c6252bc004 Don't resolve url hostnames manualy but mokey patch urllib3 (#385)
Change hostnames by ip addresses was causing certificate verification to
fail. Instead of doing it we will better monkey patch urllib3
functionality which does name resolution. It should work without
problems even for https connection.
2017-01-18 13:46:02 +01:00
Alexander Kukushkin
711d53980f Call self._load_machines_cache() method on timeout is causing switch to
a new server every 5 minutes
2017-01-12 17:30:18 +01:00
Alexander Kukushkin
d138a8db17 AT for master_start_timeout + minor fixes (#361) 2016-12-09 12:02:41 +01:00
Ants Aasma
1290b30b84 Introduce starting state and master start timeout. (#295)
Previously pg_ctl waited for a timeout and then happily trodded on considering PostgreSQL to be running. This caused PostgreSQL to show up in listings as running when it was actually not and caused a race condition that resulted in either a failover or a crash recovery or a crash recovery interrupted by failover and a missed rewind.

This change adds a master_start_timeout parameter and introduces a new state for the main run_cycle loop: starting. When master_start_timeout is zero we will fail over as soon as there is a failover candidate. Otherwise PostgreSQL will be started, but once master_start_timeout expires we will stop and release leader lock if failover is possible. Once failover succeeds or fails (no leader and no one to take the role) we continue with normal processing. While we are waiting for the master timeout we handle manual failover requests.

* Introduce timeout parameter to restart.

When restart timeout is set master becomes eligible for failover after that timeout expires regardless of master_start_time. Immediate restart calls will wait for this timeout to pass, even when node is a standby.
2016-12-08 14:44:27 +01:00
Alexander Kukushkin
ec78777778 Implement simple asynchronos dns-resolve cache (#360) 2016-12-07 13:16:26 +01:00
Alexander Kukushkin
b299b12f58 Varios configuration parameters for etcd (#358)
* Add https and auth support for etcd

Also implement support of PATRONI_ETCD_URL and PATRONI_ETCD_SRV
environment variables

* Implement etcd.proxy etcd.cacert, etcd.cert and etcd.key support

Now it should be possible to set up fully encrypted connection to etcd
with authorization.
2016-12-06 16:40:21 +01:00
Alexander Kukushkin
038b5aed72 Improve leader watch functionality (#356)
Previously replicas were always watching for leader key (even if the
postgres was not in the running there). It was not a big issue, but it
was not possible to interrupt such watch in cases if the postgres
started up or stopped successfully. Also it was delaying update_member
call and we had kind of stale information in DCS up to `loop_wait`
seconds. This commit changes such behavior. If the async_executor is
busy by starting/stopping or restarting postgres we will not watch for
leader key but waiting for event from async_executor up to `loop_wait`
seconds. Async executor will fire such event only in case if the
function it was calling returned something what could be evaluated to
boolean True.

Such functionality is really needed to change the way how we are making
decision about necessity of pg_rewind. It will require to have a local
postgres running and for us it is really important to get such
notification as soon as possible.
2016-11-22 16:22:30 +01:00
Ants Aasma
7e53a604d4 Add synchronous replication support. (#314)
Adds a new configuration variable synchronous_mode. When enabled Patroni will manage synchronous_standby_names to enable synchronous replication whenever there are healthy standbys available. With synchronous mode enabled Patroni will automatically fail over only to a standby that was synchronously replicating at the time of master failure. This effectively means zero lost user visible transactions.

To enforce the synchronous failover guarantee Patroni stores current synchronous replication state in the DCS, using strict ordering, first enable synchronous replication, then publish the information. Standby can use this to verify that it was indeed a synchronous standby before master failed and is allowed to fail over.

We can't enable multiple standbys as synchronous, allowing PostreSQL to pick one because we can't know which one was actually set to be synchronous on the master when it failed. This means that on standby failure commits will be blocked on the master until next run_cycle iteration. TODO: figure out a way to poke Patroni to run sooner or allow for PostgreSQL to pick one without the possibility of lost transactions.

On graceful shutdown standbys will disable themselves by setting a nosync tag for themselves and waiting for the master to notice and pick another standby. This adds a new mechanism for Ha to publish dynamic tags to the DCS.

When the synchronous standby goes away or disconnects a new one is picked and Patroni switches master over to the new one. If no synchronous standby exists Patroni disables synchronous replication (synchronous_standby_names=''), but not synchronous_mode. In this case, only the node that was previously master is allowed to acquire the leader lock.

Added acceptance tests and documentation.

Implementation by @ants with extensive review by @CyberDem0n.
2016-10-19 16:12:51 +02:00