Re: Interval for launching the table sync worker

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Interval for launching the table sync worker
Date: 2017-04-14 15:35:56
Message-ID: CAD21AoAC44RCid7gnXdYWrHHDYLUbnQSWxk8HDjaeROEVGqeKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 14, 2017 at 9:14 PM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> On 14/04/17 12:18, Masahiko Sawada wrote:
>> On Fri, Apr 14, 2017 at 7:09 AM, Petr Jelinek
>> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>>> On 13/04/17 12:23, Masahiko Sawada wrote:
>>>> On Thu, Apr 13, 2017 at 11:53 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>>> On Wed, Apr 12, 2017 at 11:46 PM, Peter Eisentraut
>>>>> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>>>>>> On 4/12/17 00:48, Masahiko Sawada wrote:
>>>>>>> On Wed, Apr 12, 2017 at 1:28 PM, Peter Eisentraut
>>>>>>>> Perhaps instead of a global last_start_time, we store a per relation
>>>>>>>> last_start_time in SubscriptionRelState?
>>>>>>>
>>>>>>> I was thinking the same. But a problem is that the list of
>>>>>>> SubscriptionRelState is refreshed whenever the syncing table state
>>>>>>> becomes invalid (table_state_valid = false). I guess we need to
>>>>>>> improve these logic including GetSubscriptionNotReadyRelations().
>>>>>>
>>>>>> The table states are invalidated on a syscache callback from
>>>>>> pg_subscription_rel, which happens roughly speaking when a table
>>>>>> finishes the initial sync. So if we're worried about failing tablesync
>>>>>> workers relaunching to quickly, this would only be a problem if a
>>>>>> tablesync of another table finishes right in that restart window. That
>>>>>> doesn't seem a terrible issue to me.
>>>>>>
>>>>>
>>>>> I think the table states are invalidated whenever the table sync
>>>>> worker starts, because the table sync worker updates its status of
>>>>> pg_subscription_rel and commits it before starting actual copy. So we
>>>>> cannot rely on that. I thought we can store last_start_time into
>>>>> pg_subscription_rel but it might be overkill. I'm now thinking to
>>>>> change GetSubscriptionNotReadyRealtions so that last_start_time in
>>>>> SubscriptionRelState is taken over to new list.
>>>>>
>>>>
>>>> Attached the latest patch. It didn't actually necessary to change
>>>> GetSubscriptionNotReadyRelations. I just changed the logic refreshing
>>>> the sync table state list.
>>>> Please give me feedback.
>>>>
>>>
>>> Hmm this might work. Although I was actually wondering if we could store
>>> the last start timestamp in the worker shared memory and do some magic
>>> with that (ie, not clearing subid and relid and try to then do rate
>>> limiting based on subid+relid+timestamp stored in shmem). That would
>>> then work same way for the main apply workers as well. It would have the
>>> disadvantage that if some tables were consistently failing, no other
>>> tables could get synchronized as the amount of workers is limited.
>>
>> Hmm I guess that it's not a good design that a table sync worker and a
>> apply worker for a relation takes sole possession of a worker slot
>> until it successes. It would bother each other.
>>
>
> Not sure what you mean by apply worker for relation, I meant the main
> subscription apply worker, the "per relation apply worker" is same as
> tablesync worker.

Oops, I made a mistake. I meant just a apply worker.

>
> Thinking about the possible failure scenarios more, I guess you are
> right. Because if there is "global" event for the publisher/subscriber
> connection (ie network goes down or something) it will affect the main
> worker as well so it won't launch anything. If there is table specific
> issue it will affect only that table.
>
> The corner-cases with doing it per table where we launch tablesync
> needlessly are pretty much just a) connection limit on publisher
> reached, b) slot limit on publisher reached, that's probably okay. We
> might be able to catch these two specifically and do some global
> limiting based on those (something like your original patch except the
> start-time global variable would only be set on specific error and
> stored in shmem), but that does not need to be solved immediately.
>
> --
> Petr Jelinek http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2017-04-14 15:36:43 Re: Should pg_current_wal_location() become pg_current_wal_lsn()
Previous Message Bruce Momjian 2017-04-14 15:35:25 Re: [pgsql-www] Small issue in online devel documentation build