patroni/features/citus.feature

Feature: citus
  We should check that coordinator discovers and registers workers and clients don't have errors when worker cluster switches over

  Scenario: check that worker cluster is registered in the coordinator
    Given I start postgres0 in citus group 0
    And I start postgres2 in citus group 1
    Then postgres0 is a leader in a group 0 after 10 seconds
    And postgres2 is a leader in a group 1 after 10 seconds
    When I start postgres1 in citus group 0
    And I start postgres3 in citus group 1
    Then replication works from postgres0 to postgres1 after 15 seconds
    Then replication works from postgres2 to postgres3 after 15 seconds
    And postgres0 is registered in the postgres0 as the primary in group 0 after 5 seconds
    And postgres1 is registered in the postgres0 as the secondary in group 0 after 5 seconds
    And postgres2 is registered in the postgres0 as the primary in group 1 after 5 seconds
    And postgres3 is registered in the postgres0 as the secondary in group 1 after 5 seconds

  Scenario: coordinator failover updates pg_dist_node
    Given I run patronictl.py failover batman --group 0 --candidate postgres1 --force
    Then postgres1 role is the primary after 10 seconds
    And "members/postgres0" key in a group 0 in DCS has state=running after 15 seconds
    And replication works from postgres1 to postgres0 after 15 seconds
    And postgres1 is registered in the postgres2 as the primary in group 0 after 5 seconds
    And postgres0 is registered in the postgres2 as the secondary in group 0 after 15 seconds
    And "sync" key in a group 0 in DCS has sync_standby=postgres0 after 15 seconds
    When I run patronictl.py switchover batman --group 0 --candidate postgres0 --force
    Then postgres0 role is the primary after 10 seconds
    And replication works from postgres0 to postgres1 after 15 seconds
    And postgres0 is registered in the postgres2 as the primary in group 0 after 5 seconds
    And postgres1 is registered in the postgres2 as the secondary in group 0 after 15 seconds
    And "sync" key in a group 0 in DCS has sync_standby=postgres1 after 15 seconds

  Scenario: worker switchover doesn't break client queries on the coordinator
    Given I create a distributed table on postgres0
    And I start a thread inserting data on postgres0
    When I run patronictl.py switchover batman --group 1 --force
    Then I receive a response returncode 0
    And postgres3 role is the primary after 10 seconds
    And "members/postgres2" key in a group 1 in DCS has state=running after 15 seconds
    And replication works from postgres3 to postgres2 after 15 seconds
    And postgres3 is registered in the postgres0 as the primary in group 1 after 5 seconds
    And postgres2 is registered in the postgres0 as the secondary in group 1 after 15 seconds
    And "sync" key in a group 1 in DCS has sync_standby=postgres2 after 15 seconds
    And a thread is still alive
    When I run patronictl.py switchover batman --group 1 --force
    Then I receive a response returncode 0
    And postgres2 role is the primary after 10 seconds
    And replication works from postgres2 to postgres3 after 15 seconds
    And postgres2 is registered in the postgres0 as the primary in group 1 after 5 seconds
    And postgres3 is registered in the postgres0 as the secondary in group 1 after 15 seconds
    And "sync" key in a group 1 in DCS has sync_standby=postgres3 after 15 seconds
    And a thread is still alive
    When I stop a thread
    Then a distributed table on postgres0 has expected rows

  Scenario: worker primary restart doesn't break client queries on the coordinator
    Given I cleanup a distributed table on postgres0
    And I start a thread inserting data on postgres0
    When I run patronictl.py restart batman postgres2 --group 1 --force
    Then I receive a response returncode 0
    And postgres2 role is the primary after 10 seconds
    And replication works from postgres2 to postgres3 after 15 seconds
    And postgres2 is registered in the postgres0 as the primary in group 1 after 5 seconds
    And postgres3 is registered in the postgres0 as the secondary in group 1 after 15 seconds
    And a thread is still alive
    When I stop a thread
    Then a distributed table on postgres0 has expected rows

  Scenario: check that in-flight transaction is rolled back after timeout when other workers need to change pg_dist_node
    Given I start postgres4 in citus group 2
    Then postgres4 is a leader in a group 2 after 10 seconds
    And "members/postgres4" key in a group 2 in DCS has role=primary after 3 seconds
    When I run patronictl.py edit-config batman --group 2 -s ttl=20 --force
    Then I receive a response returncode 0
    And I receive a response output "+ttl: 20"
    Then postgres4 is registered in the postgres2 as the primary in group 2 after 5 seconds
    When I shut down postgres4
    Then there is a transaction in progress on postgres0 changing pg_dist_node after 5 seconds
    When I run patronictl.py restart batman postgres2 --group 1 --force
    Then a transaction finishes in 20 seconds