Re: logical replication - still unstable after all these months

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: logical replication - still unstable after all these months
Date: 2017-05-26 08:45:33
Message-ID: 752c2572afefa737c49d707f5109cbf1@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-05-26 10:29, Mark Kirkwood wrote:
> On 26/05/17 20:09, Erik Rijkers wrote:
>
>> On 2017-05-26 09:40, Simon Riggs wrote:
>>>
>>> If we can find out what the bug is with a repeatable test case we can
>>> fix it.
>>>
>>> Could you provide more details? Thanks
>>
>> I will, just need some time to clean things up a bit.
>>
>>
>> But what I would like is for someone else to repeat my 100x1-minute
>> tests, taking as core that snippet I posted in my previous email. I
>> built bash-stuff around that core (to take md5's, shut-down/start-up
>> the two instances between runs, write info to log-files, etc). But it
>> would be good if someone else made that separately because if that
>> then does not fail, it would prove that my test-harness is at fault
>> (and not logical replication).
>>
>
> Will do - what I had been doing was running pgbench, waiting until the

Great!

You'll have to think about whether to go with instances of either
master, or master+those 4 patches. I guess either choice makes sense.

> row counts on the replica pgbench_history were the same as the
> primary, then summing the %balance and delta fields from the primary
> and replica dbs and comparing. So far - all match up ok. However I'd

I did number-summing for a while as well (because it's a lot faster than
taking md5's over the full content).
But the problem with summing is that (I think) in the end you cannot be
really sure that the result is correct (false positives, although I
don't understand the odds).

> been running a longer time frames (5 minutes), so not the same number
> of repetitions as yet.

I've run 3600-, 30- and 15-minute runs too, but in this case (these 100x
tests) I wanted to especially test the area around startup/initialise of
logical replication. Also the increasing quality of logical replication
(once it runs with the correct

thanks,

Erik Rijkers

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2017-05-26 09:23:56 Logical replication & corrupted pages recovery
Previous Message Mark Kirkwood 2017-05-26 08:29:15 Re: logical replication - still unstable after all these months