Re: Some problems of recovery conflict wait events

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Some problems of recovery conflict wait events
Date: 2020-03-09 15:57:52
Message-ID: 54d5571a-e15f-6360-2762-fde7c5c126c8@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/03/09 14:10, Masahiko Sawada wrote:
> On Mon, 9 Mar 2020 at 13:24, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>
>>
>>
>> On 2020/03/08 13:52, Masahiko Sawada wrote:
>>> On Thu, 5 Mar 2020 at 20:16, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/03/05 16:58, Masahiko Sawada wrote:
>>>>> On Wed, 4 Mar 2020 at 15:21, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/03/04 14:31, Masahiko Sawada wrote:
>>>>>>> On Wed, 4 Mar 2020 at 13:48, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2020/03/04 13:27, Michael Paquier wrote:
>>>>>>>>> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
>>>>>>>>>> Yeah, so 0001 patch sets existing wait events to recovery conflict
>>>>>>>>>> resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
>>>>>>>>>> to the recovery conflict on a snapshot. 0003 patch improves these wait
>>>>>>>>>> events by adding the new type of wait event such as
>>>>>>>>>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
>>>>>>>>>> is the fix for existing versions and 0003 patch is an improvement for
>>>>>>>>>> only PG13. Did you mean even 0001 patch doesn't fit for back-patching?
>>>>>>>>
>>>>>>>> Yes, it looks like a improvement rather than bug fix.
>>>>>>>>
>>>>>>>
>>>>>>> Okay, understand.
>>>>>>>
>>>>>>>>> I got my eyes on this patch set. The full patch set is in my opinion
>>>>>>>>> just a set of improvements, and not bug fixes, so I would refrain from
>>>>>>>>> back-backpatching.
>>>>>>>>
>>>>>>>> I think that the issue (i.e., "waiting" is reported twice needlessly
>>>>>>>> in PS display) that 0002 patch tries to fix is a bug. So it should be
>>>>>>>> fixed even in the back branches.
>>>>>>>
>>>>>>> So we need only two patches: one fixes process title issue and another
>>>>>>> improve wait event. I've attached updated patches.
>>>>>>
>>>>>> Thanks for updating the patches! I started reading 0001 patch.
>>>>>
>>>>> Thank you for reviewing this patch.
>>>>>
>>>>>>
>>>>>> - /*
>>>>>> - * Report via ps if we have been waiting for more than 500 msec
>>>>>> - * (should that be configurable?)
>>>>>> - */
>>>>>> - if (update_process_title && new_status == NULL &&
>>>>>> - TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
>>>>>> - 500))
>>>>>>
>>>>>> The patch changes ResolveRecoveryConflictWithSnapshot() and
>>>>>> ResolveRecoveryConflictWithTablespace() so that they always add
>>>>>> "waiting" into the PS display, whether wait is really necessary or not.
>>>>>> But isn't it better to display "waiting" in PS basically when wait is
>>>>>> necessary, like originally ResolveRecoveryConflictWithVirtualXIDs()
>>>>>> does as the above?
>>>>>
>>>>> You're right. Will fix it.
>>>>>
>>>>>>
>>>>>> ResolveRecoveryConflictWithDatabase(Oid dbid)
>>>>>> {
>>>>>> + char *new_status = NULL;
>>>>>> +
>>>>>> + /* Report via ps we are waiting */
>>>>>> + new_status = set_process_title_waiting();
>>>>>>
>>>>>> In ResolveRecoveryConflictWithDatabase(), there seems no need to
>>>>>> display "waiting" in PS because no wait occurs when recovery conflict
>>>>>> with database happens.
>>>>>
>>>>> Isn't the startup process waiting for other backend to terminate?
>>>>
>>>> Yeah, you're right. I agree that "waiting" should be reported in this case.
>>>>
>>>> Currently ResolveRecoveryConflictWithLock() and
>>>> ResolveRecoveryConflictWithBufferPin() don't call
>>>> ResolveRecoveryConflictWithVirtualXIDs and don't report "waiting"
>>>> in PS display. You changed them so that they report "waiting". I agree
>>>> to have this change. But this change is an improvement rather than
>>>> a bug fix, i.e., we should apply this change only in v13?
>>>>
>>>
>>> Did you mean ResolveRecoveryConflictWithDatabase and
>>> ResolveRecoveryConflictWithBufferPin?
>>
>> Yes! Sorry for my typo.
>>
>>> In the current code as far as I
>>> researched there are two cases where we don't add "waiting" and one
>>> case where we doubly add "waiting".
>>>
>>> ResolveRecoveryConflictWithDatabase and
>>> ResolveRecoveryConflictWithBufferPin don't update the ps title.
>>> Although the path where GetCurrentTimestamp() >= ltime is false in
>>> ResolveRecoveryConflictWithLock also doesn't update the ps title, it's
>>> already updated in WaitOnLock. On the other hand, the path where
>>> GetCurrentTimestamp() >= ltime is true in
>>> ResolveRecoveryConflictWithLock updates the ps title but it's wrong as
>>> I reported.
>>>
>>> I've split the patch into two patches: 0001 patch fixes the issue
>>> about doubly updating ps title, 0002 patch makes the recovery conflict
>>> resolution on database and buffer pin update the ps title.
>>
>> Thanks for splitting the patches. I think that 0001 patch can be back-patched.
>>
>> - /*
>> - * Report via ps if we have been waiting for more than 500 msec
>> - * (should that be configurable?)
>> - */
>> - if (update_process_title && new_status == NULL &&
>> - TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
>> - 500))
>>
>> Originally, "waiting" is reported in PS if we've been waiting for more than
>> 500 msec, as the above does. But you got rid of those codes in the patch.
>> Did you confirm that it's safe to do that? If not, isn't it better to apply
>> the attached patch?
>
> In WaitOnLock() we update the ps title regardless of waiting time. So
> I thought we can change it to make these behavior consistent. But
> considering back-patch, your patch looks better than mine.

Yeah, so I pushed the 0001 patch at first!
I will review the other patches later.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-03-09 15:59:28 Re: Atomics in localbuf.c
Previous Message Alvaro Herrera 2020-03-09 15:55:24 Re: pgsql: pageinspect: Fix types used for bt_metap() columns.