| From: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
|---|---|
| To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, michail(dot)nikolaev(at)gmail(dot)com, x4mmm(at)yandex-team(dot)ru, andres(at)anarazel(dot)de, simon(dot)riggs(at)enterprisedb(dot)com |
| Subject: | Re: BufferAlloc: don't take two simultaneous locks |
| Date: | 2022-03-14 11:57:38 |
| Message-ID: | 8a8c141222e372358a601dd4e304a96f206b92fb.camel@postgrespro.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote in
> > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > I'd like to ask you to remove nalloced from partitions then add a
> > > global atomic for the same use?
> >
> > I really believe it should be global. I made it per-partition to
> > not overcomplicate first versions. Glad you tell it.
> >
> > I thought to protect it with freeList[0].mutex, but probably atomic
> > is better idea here. But which atomic to chose: uint64 or uint32?
> > Based on sizeof(long)?
> > Ok, I'll do in next version.
>
> Current nentries is a long (= int64 on CentOS). And uint32 can support
> roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> enough. So it would be uint64.
>
> > Whole get_hash_entry look strange.
> > Doesn't it better to cycle through partitions and only then go to
> > get_hash_entry?
> > May be there should be bitmap for non-empty free lists? 32bit for
> > 32 partitions. But wouldn't bitmap became contention point itself?
>
> The code puts significance on avoiding contention caused by visiting
> freelists of other partitions. And perhaps thinks that freelist
> shortage rarely happen.
>
> I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> 128kB shared buffers and I saw that get_hash_entry never takes the
> !element_alloc() path and always allocate a fresh entry, then
> saturates at 30 new elements allocated at the medium of a 100 seconds
> run.
>
> Then, I tried the same with the patch, and I am surprized to see that
> the rise of the number of newly allocated elements didn't stop and
> went up to 511 elements after the 100 seconds run. So I found that my
> concern was valid. The change in dynahash actually
> continuously/repeatedly causes lack of free list entries. I'm not
> sure how much the impact given on performance if we change
> get_hash_entry to prefer other freelists, though.
Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.
Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.
> By the way, there's the following comment in StrategyInitalize.
>
> > * Initialize the shared buffer lookup hashtable.
> > *
> > * Since we can't tolerate running out of lookup table entries, we must be
> > * sure to specify an adequate table size here. The maximum steady-state
> > * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> > * a new entry before deleting the old. In principle this could be
> > * happening in each partition concurrently, so we could need as many as
> > * NBuffers + NUM_BUFFER_PARTITIONS entries.
> > */
> > InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
>
> "but BufferAlloc() tries to insert a new entry before deleting the
> old." gets false by this patch but still need that additional room for
> stashed entries. It seems like needing a fix.
>
>
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2022-03-14 12:05:24 | Re: Column Filtering in Logical Replication |
| Previous Message | osumi.takamichi@fujitsu.com | 2022-03-14 11:53:24 | RE: Optionally automatically disable logical replication subscriptions on error |