BDR problem

From: Charles Lynch <charleslynchpostgresql(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: BDR problem
Date: 2015-09-11 21:21:41
Message-ID: CAEoYqXBH1yLBH=Fzux4TC6SKjEqcDnRBYAvaznmGy7gE0C9SCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

So for about a month now, we've been getting things prepared to use a BDR
cluster in a production, multi-region setup on aws. Our initial testing
produced some absolutely fantastic results with replication delays less
than 150ms between singapore, ireland, and north virginia and this is will
SSL encryption.

We have, just recently, ran into a problem. I created a test cluster only
within NV and after about a week of working without any problems, we got an
error: Unexpected EOF on SSL connection. I had seen something like this
before but on initial cluster join and chalked it up to me doing something
wrong. This was after a week of working without issue. I wasn't sure what
to do next. restarting the database started producing errors like this:

LOG: starting background worker process "bdr
(6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL: mismatch in worker state, got 3, expected 1
LOG: starting background worker process "bdr
(6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL: mismatch in worker state, got 3, expected 1
FATAL: mismatch in worker state, got 3, expected 1
LOG: starting background worker process "bdr
(6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
LOG: worker process: bdr (6188205071755053119,1,16385,)->bdr
(6188203625564571611,1, (PID 20300) exited with exit code 1

This would repeat. So I removed this node from the cluster using the proper
bdr commands and tried re-joining but that just resulted in the return
error changing from a 3 to a 0 and the same errors repeating. I have BDR
completely automated and orchestrated using chef so I simply fired up a new
cluster and started over.

My problem is I don't know what caused this and, more importantly, I'm not
sure how to fix it / prevent it and I can't launch this into production
without figuring this out.

One other thing: I've seen a lot of conflicting information on how to setup
BDR on ubuntu (using ppas, what pkg to install, and where to get source)
I'm curious now if I don't have a younger version and that this issue is
all but fixed now. Here are my build steps if anyone has any comments on
how to setup bdr better, please let me know.

I grab postgres 9.4.4 from here:
https://github.com/2ndQuadrant/bdr/archive/bdr-pg/REL9_4_4-1.tar.gz
and compile it with "./configure --prefix=/opt/psql --with-openssl && make
-j4 -s install"

then I compile and install the btree_gist module

then I get the BDR plugin from here:
https://github.com/2ndQuadrant/bdr/archive/bdr-plugin/0.9.2.tar.gz
and compile it with "./configure && make -j4 -s all && make install"

then init the db and set everything with config, ssl certs, and cluster
creation and joining.

Any help on this would be really appreciated.

Thanks guys

Charles

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Daniel Verite 2015-09-12 14:38:35 Re: clone_schema function
Previous Message Melvin Davidson 2015-09-11 20:47:18 Re: clone_schema function