From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | bossartn(at)amazon(dot)com |
Cc: | x4mmm(at)yandex-team(dot)ru, a(dot)lubennikova(at)postgrespro(dot)ru, hlinnaka(at)iki(dot)fi, matsumura(dot)ryo(at)fujitsu(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: archive status ".ready" files may be created too early |
Date: | 2021-01-27 02:35:00 |
Message-ID: | 20210127.113500.1720002622159938202.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> >> > You're right in that regard. There's a window where partial record is
> >> > written when write location passes F0 after insertion location passes
> >> > F1. However, remembering all spanning records seems overkilling to me.
> >>
> >> I'm curious why you feel that recording all cross-segment records is
> >> overkill. IMO it seems far simpler to just do that rather than try to
> >
> > Sorry, my words are not enough. Remembering all spanning records in
> > *shared memory* seems to be overkilling. Much more if it is stored in
> > shared hash table. Even though it rarely the case, it can fail hard
> > way when reaching the limit. If we could do well by remembering just
> > two locations, we wouldn't need to worry about such a limitation.
>
> I don't think it will fail if we reach max_size for the hash table.
> The comment above ShmemInitHash() has this note:
>
> * max_size is the estimated maximum number of hashtable entries. This is
> * not a hard limit, but the access efficiency will degrade if it is
> * exceeded substantially (since it's used to compute directory size and
> * the hash table buckets will get overfull).
That description means that a shared hash has a directory with fixed
size thus there may be synonyms, which causes degradation. Even though
buckets are preallocated with the specified number, since the minimum
directory size is 256, buckets are allocated at least 256 in a long
run. Minimum on-the-fly allocation size is 32. I haven't calcuated
further precicely, but I'm worried about the amount of spare shared
memory the hash can allocate.
> > Another concern about the concrete patch:
> >
> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> > LWLock every time XLogWrite is called while segment archive is being
> > held off. I don't think it is acceptable and I think it could be a
> > problem when many backends are competing on WAL.
>
> This is a fair point. I did some benchmarking with a few hundred
> connections all doing writes, and I was not able to discern any
> noticeable performance impact. My guess is that contention on this
> new lock is unlikely because callers of XLogWrite() must already hold
> WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no
> more than once per segment to record new record boundaries.
Thanks. I agree that the reader-reader contention is not a problem due
to existing serialization by WALWriteLock. Adding an entry happens
only at segment boundary so the ArchNotifyLock doesn't seem to be a
problem.
However the function prolongs the WALWriteLock section. Couldn't we
somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
out of the WALWriteLock section?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Sehrope Sarkuni | 2021-01-27 02:53:52 | Re: Add SQL function for SHA1 |
Previous Message | Zhihong Yu | 2021-01-27 02:19:44 | Re: logical replication worker accesses catalogs in error context callback |