From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager |
Date: | 2018-06-05 04:22:58 |
Message-ID: | CAD21AoA7rvsxLuWD47m7647G6ie+SDpJY0kHeNqv+w1dnV1bzw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
<k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
>
> On 26.04.2018 09:10, Masahiko Sawada wrote:
>>
>> On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>>
>>> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
>>> wrote:
>>>>
>>>> Never mind. There was a lot of items especially at the last CommitFest.
>>>>
>>>>> In terms of moving forward, I'd still like to hear what
>>>>> Andres has to say about the comments I made on March 1st.
>>>>
>>>> Yeah, agreed.
>>>
>>> $ ping -n andres.freund
>>> Request timeout for icmp_seq 0
>>> Request timeout for icmp_seq 1
>>> Request timeout for icmp_seq 2
>>> Request timeout for icmp_seq 3
>>> Request timeout for icmp_seq 4
>>> ^C
>>> --- andres.freund ping statistics ---
>>> 6 packets transmitted, 0 packets received, 100.0% packet loss
>>>
>>> Meanwhile,
>>> https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
>>> shows that this patch has some benefits for other cases, which is a
>>> point in favor IMHO.
>>
>> Thank you for sharing. That's good to know.
>>
>> Andres pointed out the performance degradation due to hash collision
>> when multiple loading. I think the point is that it happens at where
>> users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
>> configurable parameter, since users don't know the hash collision they
>> don't know when they should tune it.
>>
>> So it's just an idea but how about adding an SQL-callable function
>> that returns the estimated number of lock waiters of the given
>> relation? Since user knows how many processes are loading to the
>> relation, if a returned value by the function is greater than the
>> expected value user can know hash collision and will be able to start
>> to consider to increase N_RELEXTLOCK_ENTS.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada
>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>> NTT Open Source Software Center
>>
> We in PostgresProc were faced with lock extension contention problem at two
> more customers and tried to use this patch (v13) to address this issue.
> Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
> contention, now most of backends are blocked on conditional variable:
>
> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1 0x00000000007024ee in WaitEventSetWait ()
> #2 0x0000000000718fa6 in ConditionVariableSleep ()
> #3 0x000000000071954d in RelExtLockAcquire ()
> #4 0x00000000004ba99d in RelationGetBufferForTuple ()
> #5 0x00000000004b3f18 in heap_insert ()
> #6 0x00000000006109c8 in ExecInsert ()
> #7 0x0000000000611a49 in ExecModifyTable ()
> #8 0x00000000005ef97a in standard_ExecutorRun ()
> #9 0x000000000072440a in ProcessQuery ()
> #10 0x0000000000724631 in PortalRunMulti ()
> #11 0x00000000007250ec in PortalRun ()
> #12 0x0000000000721287 in exec_simple_query ()
> #13 0x0000000000722532 in PostgresMain ()
> #14 0x000000000047a9eb in ServerLoop ()
> #15 0x00000000006b9fe9 in PostmasterMain ()
> #16 0x000000000047b431 in main ()
>
> Obviously there is nothing surprising here: if a lot of processes try to
> acquire the same exclusive lock, then high contention is expected.
> I just want to notice that this patch is not able to completely eliminate
> the problem with large number of concurrent inserts to the same table.
>
> Second problem we observed was even more critical: if backed is granted
> relation extension lock and then got some error before releasing this lock,
> then abort of the current transaction doesn't release this lock (unlike
> heavy weight lock) and the relation is kept locked.
> So database is actually stalled and server has to be restarted.
>
Thank you for reporting.
Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction steps of the cause of database stalled
if possible?
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2018-06-05 04:28:56 | Re: pg_replication_slot_advance to return NULL instead of 0/0 if slot not advanced |
Previous Message | Michael Paquier | 2018-06-05 04:20:58 | Re: pg_replication_slot_advance to return NULL instead of 0/0 if slot not advanced |