Re: Quorum commit for multiple synchronous replication.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Quorum commit for multiple synchronous replication.
Date: 2016-12-12 16:24:06
Message-ID: CAD21AoBqgOvUX_mdKu6GNDkB57Ch4E1SBO-b-3YEhErw0iVrug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 12, 2016 at 9:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>> On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
>>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>>>> On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>>>> You could do that, but first I would code up the simplest, cleanest
>>>>>> algorithm you can think of and see if it even shows up in a 'perf'
>>>>>> profile. Microbenchmarking is probably overkill here unless a problem
>>>>>> is visible on macrobenchmarks.
>>>>>
>>>>> This is what I would go for! The current code is doing a simple thing:
>>>>> select the Nth element using qsort() after scanning each WAL sender's
>>>>> values. And I think that Sawada-san got it right. Even running on my
>>>>> laptop a pgbench run with 10 sync standbys using a data set that fits
>>>>> into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
>>>>> using perf top on a non-assert, non-debug build. Hash tables and
>>>>> allocations get a far larger share. Using the patch,
>>>>> SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
>>>>> nodes. Let's kick the ball for now. An extra patch could make things
>>>>> better later on if that's worth it.
>>>>
>>>> Yeah, since the both K and N could be not large these algorithm takes
>>>> almost the same time. And current patch does simple thing. When we
>>>> need over 100 or 1000 replication node the optimization could be
>>>> required.
>>>> Attached latest v9 patch.
>>>>
>>>
>>> Few comments:
>>
>> Thank you for reviewing.
>>
>>> + * In 10.0 we support two synchronization methods, priority and
>>> + * quorum. The number of synchronous standbys that transactions
>>> + * must wait for replies from and synchronization method are specified
>>> + * in synchronous_standby_names. This parameter also specifies a list
>>> + * of standby names, which determines the priority of each standby for
>>> + * being chosen as a synchronous standby. In priority method, the standbys
>>> + * whose names appear earlier in the list are given higher priority
>>> + * and will be considered as synchronous. Other standby servers appearing
>>> + * later in this list represent potential synchronous standbys. If any of
>>> + * the current synchronous standbys disconnects for whatever reason,
>>> + * it will be replaced immediately with the next-highest-priority standby.
>>> + * In quorum method, the all standbys appearing in the list are
>>> + * considered as a candidate for quorum commit.
>>>
>>> In the above description, is priority method represented by FIRST and
>>> quorum method by ANY in the synchronous_standby_names syntax? If so,
>>> it might be better to write about it explicitly.
>>
>> Added description.
>>
>>>
>>> 2.
>>> --- a/src/backend/utils/sort/tuplesort.c
>>> +++ b/src/backend/utils/sort/tuplesort.c
>>> @@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
>>> }
>>>
>>> /*
>>> - * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
>>> - * from each input tape.
>>> - */
>>> - state->memtupsize = numInputTapes;
>>> - state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
>>> - USEMEM(state, GetMemoryChunkSpace(state->memtuples));
>>> -
>>> - /*
>>> - * Use all the remaining memory we have available for read buffers among
>>> - * the input tapes.
>>> + * Use all the spare memory we have available for read buffers among the
>>> + * input tapes.
>>>
>>> This doesn't belong to this patch.
>>
>> Oops, fixed.
>>
>>> 3.
>>> + * Return the list of sync standbys using according to synchronous method,
>>>
>>> In above sentence, "using according" seems to either incomplete or wrong usage.
>>
>> Fixed.
>>
>>> 4.
>>> + else
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>> + "invalid synchronization method is specified \"%d\"",
>>> + SyncRepConfig->sync_method));
>>>
>>> Here, the error message doesn't seem to aligned and you might want to
>>> use errmsg for the same.
>>>
>>
>> Fixed.
>>
>>> 5.
>>> + * In priority method, we need the oldest these positions among sync
>>> + * standbys. In quorum method, we need the newest these positions
>>> + * specified by SyncRepConfig->num_sync.
>>>
>>> /oldest these/oldest of these
>>> /newest these positions specified/newest of these positions as specified
>>
>> Fixed.
>>
>>> Instead of newest, can we consider to use latest?
>>
>> Yeah, I changed it so.
>>
>>
>>> 6.
>>> + int sync_method; /* synchronization method */
>>> /* member_names contains nmembers consecutive nul-terminated C strings */
>>> char member_names[FLEXIBLE_ARRAY_MEMBER];
>>> } SyncRepConfigData;
>>>
>>> Can't we use 1 or 2 bytes to store sync_method information?
>>
>> I changed it to uint8.
>>
>> Attached latest v10 patch incorporated the review comments so far.
>> Please review it.
>
> Thanks for updating the patch!
>
> Do we need to update postgresql.conf.sample?

Added description to postgresql.conf.sample.

> +{any_ident} {
> + yylval.str = pstrdup(yytext);
> + return ANY;
> + }
> +{first_ident} {
> + yylval.str = pstrdup(yytext);
> + return FIRST;
> + }
>
> Why is pstrdup(yytext) necessary here?

The first whole line was unnecessary actually. Removed.

> + standby_list { $$ =
> create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
> + | NUM '(' standby_list ')' { $$ =
> create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
> + | ANY NUM '(' standby_list ')' { $$ =
> create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
> + | FIRST NUM '(' standby_list ')' { $$ =
> create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
>
> Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
> differently from curent version while "list" works in the same way as
> current one) very confusing?
>
> I prefer to either of
>
> 1. break the backward-compatibility, i.e., treat the first syntax of
> "standby_list" as quorum commit
> 2. keep the backward-compatibility, i.e., treat the second syntax of
> "NUM (standby_list)" as sync rep with the priority

There were some comments when I proposed the quorum commit. If we do
#1 it breaks the backward-compatibility with 9.5 or before as well. I
don't think it's a good idea. On the other hand, if we do #2 then the
behaviour of s_s_name is 'NUM (standby_list)' == 'FIRST NUM
(standby_list)''. But it would not what most of user will want and
would confuse the users of future version who will want to use the
quorum commit. Since many hackers thought that the sensible default
behaviour is 'NUM (standby_list)' == 'ANY NUM (standby_list)' the
current patch chose to changes the behaviour of s_s_names and document
that changes thoroughly.

> + <literal>pg_stat_replication</></link> view). If the keyword
> + <literal>FIRST</> is specified, other standby servers appearing
> + later in this list represent potential synchronous standbys.
> + If any of the current synchronous standbys disconnects for
> + whatever reason, it will be replaced immediately with the
> + next-highest-priority standby. Specifying more than one standby
> + name can allow very high availability.
>
> It seems strange to explain the behavior of FIRST before explaining
> the syntax of synchronous_standby_names and FIRST.
>

Updated document.

Attached latest v11 patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
000_quorum_commit_v11.patch application/octet-stream 27.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-12-12 16:26:32 Re: exposing wait events for non-backends (was: Tracking wait event for latches)
Previous Message Robert Haas 2016-12-12 16:19:21 Re: exposing wait events for non-backends (was: Tracking wait event for latches)