Re: [BDR] Node Join Question

From: "Wayne E(dot) Seguin" <wayneeseguin(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [BDR] Node Join Question
Date: 2015-05-12 06:36:09
Message-ID: CANf8RLu1AZoemPrRVShQ29yJt1Wzq9WY-PEP-LPBx3Th_PqE+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Also,

Is there a way to remove these things from the init target node easier?

d= p=504 a=ERROR: 55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL: Found bdr.bdr_nodes entry for bdr
(6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT: Remove all replication identifiers and slots
corresponding to this node from the init target node then drop and recreate
this database and try again

On Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin(at)gmail(dot)com>
wrote:

> To the wonderful BDR folks,
>
> I have a 5 node pg+bdr cluster that I've spun up. I have verified that
> each node can connect via psql to each other node. I am having an issue
> creating the bdr group and having the nodes join each other.
>
> The steps I am taking are as follows (I used pg compiled using getting
> started guide instructions):
> 1. initdb
> 2. start postgres
> 3. create database pgbdr
> 4. create extension btree_gist (actually I'm doing this on template1 for
> my purposes)
> 5. create extension bdr
> 6. on node 0 I am doing:
> SELECT bdr.bdr_group_create(
> local_node_name := 'pgbdr0',
> node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres
> dbname=pgbdr'
> );
> 7. on nodes 1-4 I am doing (adjusted for the nodes IP):
> SELECT bdr.bdr_group_join(
> local_node_name := 'pgbdr1',
> node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres
> dbname=pgbdr',
> join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres
> dbname=pgbdr'
> );
>
> Sometimes this works and all nodes are joined, but it isn't consistent and
> I'm not sure at the moment how to debug why.
>
> From the most recent attempt on node 0:
>
> pgbdr=# SELECT * FROM bdr.bdr_nodes;
> node_sysid | node_timeline | node_dboid | node_status |
> node_name | node_local_dsn |
> node_init_from_dsn
>
> ---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
> 6147857338297172162 | 1 | 16908 | r | rdpg0
> | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
> (1 row)
>
> From the most recent attempt on node 1:
>
> pgbdr=# SELECT * FROM bdr.bdr_nodes;
> node_sysid | node_timeline | node_dboid | node_status |
> node_name | node_local_dsn |
> node_init_from_d
> sn
>
> ---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
> --------------------
> 6147857345042854082 | 1 | 16908 | i | rdpg1
> | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2
> port=5432 user=po
> stgres dbname=pgbdr
> (1 row)
>
> I have pasted some relevant logs from node 0 further below.
>
> Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will
> usually just hang.
>
> How can I debug why the nodes are not connecting to each other find out
> where they getting stuck trying to join?
>
> ~Wayne
>
> Wayne E. Seguin
> wayneeseguin(at)gmail(dot)com
> wayneeseguin on irc.freenode.net
> http://twitter.com/wayneeseguin/
> https://github.com/wayneeseguin/
>
>
>
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
> received replication command: IDENTIFY_SYSTEM
> [171/773]
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
> received replication command: CREATE_REPLICATION_SLOT
> "bdr_16908_6147857345042854082_1_16908__"
> LOGICAL bdr
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000:
> searching for logical decoding starting point, starting at 0/18E87A0
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
> DecodingContextFindStartpoint, logical.c:469
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
> logical decoding found consistent point at 0/18E87A0
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL: There are
> no running transactions.
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
> SnapBuildFindSnapshot, snapbuild.c:1271
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000:
> exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
> SnapBuildExportSnapshot, snapbuild.c:604
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
> received replication command: IDENTIFY_SYSTEM
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
> received replication command: CREATE_REPLICATION_SLOT
> "bdr_16908_6147857344392790209_1_16908__"
> LOGICAL bdr
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000:
> searching for logical decoding starting point, starting at 0/18EC970
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> DecodingContextFindStartpoint, logical.c:469
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
> logical decoding found initial starting point at 0/18EC970
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: 1
> transaction needs to finish.
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> SnapBuildFindSnapshot, snapbuild.c:1338
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
> received replication command: IDENTIFY_SYSTEM
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
> received replication command: CREATE_REPLICATION_SLOT
> "bdr_16908_6147857340000055490_1_16908__"
> LOGICAL bdr
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 08006:
> unexpected EOF on client connection with an open transaction
> d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:
> SocketBackend, postgres.c:344
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000:
> searching for logical decoding starting point, starting at 0/18EC9F0
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> DecodingContextFindStartpoint, logical.c:469
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
> logical decoding found initial starting point at 0/18EC9F0
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: 1
> transaction needs to finish.
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> SnapBuildFindSnapshot, snapbuild.c:1338
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
> logical decoding found consistent point at 0/18ECA30
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:
> Transaction ID 1822 finished; no more running transactions.
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> SnapBuildEndTxn, snapbuild.c:907
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
> logical decoding found consistent point at 0/18ECA30
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:
> Transaction ID 1822 finished; no more running transactions.
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> SnapBuildEndTxn, snapbuild.c:907
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000:
> exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> SnapBuildExportSnapshot, snapbuild.c:604
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000:
> exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> SnapBuildExportSnapshot, snapbuild.c:604
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
> received replication command: IDENTIFY_SYSTEM
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
> received replication command: CREATE_REPLICATION_SLOT
> "bdr_16908_6147857340923527359_1_16908__"
> LOGICAL bdr
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> exec_replication_command, walsender.c:1291
> d= p=392 a=DEBUG: 00000: started 0 elections
> d= p=392 a=LOCATION: bdr_sequencer_start_elections, bdr_seq.c:688
> d= p=392 a=DEBUG: 00000: started 0 votes
> d= p=392 a=LOCATION: bdr_sequencer_vote, bdr_seq.c:629
> d= p=392 a=DEBUG: 00000: tallied 0 elections
> d= p=392 a=LOCATION: bdr_sequencer_tally, bdr_seq.c:755
> d= p=392 a=DEBUG: 00000: checked 0 sequences for filling
> d= p=392 a=LOCATION: bdr_sequencer_fill_sequences, bdr_seq.c:1034
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 08006:
> unexpected EOF on client connection with an open transaction
> d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:
> SocketBackend, postgres.c:344
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 08006:
> unexpected EOF on client connection with an open transaction
> d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:
> SocketBackend, postgres.c:344
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000:
> searching for logical decoding starting point, starting at 0/18ECA60
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> DecodingContextFindStartpoint, logical.c:469
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
> logical decoding found initial starting point at 0/18ECA60
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: 2
> transactions need to finish.
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> SnapBuildFindSnapshot, snapbuild.c:1338
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
> logical decoding found consistent point at 0/18ECAD0
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:
> Transaction ID 1823 finished; no more running transactions.
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> SnapBuildEndTxn, snapbuild.c:907
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000:
> exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> SnapBuildExportSnapshot, snapbuild.c:604
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 08006:
> unexpected EOF on client connection with an open transaction
> d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:
> SocketBackend, postgres.c:344
>

--
~Wayne

Wayne E. Seguin
wayneeseguin(at)gmail(dot)com
wayneeseguin on irc.freenode.net
http://twitter.com/wayneeseguin/
https://github.com/wayneeseguin/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Wayne E. Seguin 2015-05-12 06:37:23 Re: [BDR] Node Join Question
Previous Message Wayne E. Seguin 2015-05-12 06:33:59 [BDR] Node Join Question