BDR: SSL error: bad write retry

From: Florin Andrei <florin(at)andrei(dot)myip(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: BDR: SSL error: bad write retry
Date: 2015-11-03 22:04:59
Message-ID: d3da40a0ff19413ffecf7ec482b7c4c6@andrei.myip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

BDR-0.9.3 and PG-9.4.4 on Ubuntu 14.04

Two nodes, BDR replication. Cluster is newly created, no nodes have been
removed from it. Making/deleting small tables works well across the
cluster.

Now I'm trying to pg_restore a larger database from another system
(pg_dump output file is 3.1 GB compressed). I am running pg_restore on
one node in the cluster, and hoping that BDR would replicate changes to
the other node.

It went well through creating schemas, extensions, comments, views,
functions, sequences and tables. It imported data into tables. Now it's
trying to create indexes - but it takes a very long time and I see
errors in the logs.

I see periodic data transfers between instances. CPU usage shoots up,
then back down. There's a cycle of a few minutes where these things
occur.

On the instance where I'm running pg_restore, I see this in the logs:

###############################################
2015-11-03 21:45:35.328
UTC,"postgres","pgmirror",4918,"10.1.1.169:36334",56392a26.1336,5,"idle",2015-11-03 21:41:58 UTC,5/146,0,LOG,08P01,"SSL error: bad write retry",,,,,"slot ""bdr_16387_6212727469166484615_1_16387__"", output plugin ""bdr"", in the change callback, associated LSN 0/2B29E50",,,,"bdr (6212727469166484615,1,16387,):receive"
2015-11-03 21:45:35.328
UTC,"postgres","pgmirror",4918,"10.1.1.169:36334",56392a26.1336,6,"idle",2015-11-03 21:41:58 UTC,5/146,0,LOG,08006,"could not receive data from client: Connection reset by peer",,,,,"slot ""bdr_16387_6212727469166484615_1_16387__"", output plugin ""bdr"", in the change callback, associated LSN 0/2B29E50",,,,"bdr (6212727469166484615,1,16387,):receive"
2015-11-03 21:45:35.328
UTC,"postgres","pgmirror",4918,"10.1.1.169:36334",56392a26.1336,7,"idle",2015-11-03 21:41:58 UTC,5/146,0,LOG,08P01,"unexpected EOF on standby connection",,,,,"slot ""bdr_16387_6212727469166484615_1_16387__"", output plugin ""bdr"", in the change callback, associated LSN 0/2B29E50",,,,"bdr (6212727469166484615,1,16387,):receive"
2015-11-03 21:45:35.330
UTC,"postgres","pgmirror",4918,"10.1.1.169:36334",56392a26.1336,8,"idle",2015-11-03 21:41:58 UTC,,0,LOG,00000,"disconnection: session time: 0:03:37.234 user=postgres database=pgmirror host=10.1.1.169 port=36334",,,,,,,,,"bdr (6212727469166484615,1,16387,):receive"
2015-11-03 21:45:40.421 UTC,,,5066,"",56392b04.13ca,1,"",2015-11-03
21:45:40 UTC,,0,LOG,00000,"connection received: host=10.1.1.169
port=36335",,,,,,,,,""
2015-11-03 21:45:40.427
UTC,"postgres","pgmirror",5066,"10.1.1.169:36335",56392b04.13ca,2,"authentication",2015-11-03 21:45:40 UTC,5/149,0,LOG,00000,"replication connection authorized: user=postgres SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)",,,,,,,,,""
2015-11-03 21:45:40.431
UTC,"postgres","pgmirror",5066,"10.1.1.169:36335",56392b04.13ca,3,"idle",2015-11-03 21:45:40 UTC,5/0,0,LOG,00000,"starting logical decoding for slot ""bdr_16387_6212727469166484615_1_16387__""","streaming transactions committing after 1/14FEDFE0, reading WAL from 0/187AFA0",,,,,,,,"bdr (6212727469166484615,1,16387,):receive"
2015-11-03 21:45:40.435
UTC,"postgres","pgmirror",5066,"10.1.1.169:36335",56392b04.13ca,4,"idle",2015-11-03 21:45:40 UTC,5/0,0,LOG,00000,"logical decoding found consistent point at 0/187AFA0","Logical decoding will begin using saved snapshot.",,,,,,,,"bdr (6212727469166484615,1,16387,):receive"
###############################################

On the other node in the cluster (supposed to receive the data via BDR)
I see this in the logs:

###############################################
2015-11-03 21:45:31.885 UTC,,,3391,,56391e57.d3f,19,,2015-11-03 20:51:35
UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2015-11-03 21:45:35.400 UTC,,,4773,,56392a26.12a5,1,,2015-11-03 21:41:58
UTC,3/201,876,ERROR,XX000,"connection to other side has
died",,,,,,,,,"bdr (6212727469166484615,1,16387,): apply"
2015-11-03 21:45:35.408 UTC,,,3388,,56391e56.d3c,34,,2015-11-03 20:51:34
UTC,,0,LOG,00000,"worker process: bdr
(6212727469166484615,1,16387,)->bdr (6212727476664052871,1, (PID 4773)
exited with exit code 1",,,,,,,,,""
2015-11-03 21:45:40.412 UTC,,,3388,,56391e56.d3c,35,,2015-11-03 20:51:34
UTC,,0,LOG,00000,"starting background worker process ""bdr
(6212727469166484615,1,16387,)->bdr
(6212727476664052871,1,""",,,,,,,,,""
###############################################

--
Florin Andrei
http://florin.myip.org/

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2015-11-03 22:57:44 Re: Taking lot time
Previous Message Florin Andrei 2015-11-03 20:35:12 Re: BDR: name conflict when joining a rebuilt node