Re: BufferAlloc: don't take two simultaneous locks

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, y(dot)sokolov(at)postgrespro(dot)ru, pgsql-hackers(at)lists(dot)postgresql(dot)org, michail(dot)nikolaev(at)gmail(dot)com, x4mmm(at)yandex-team(dot)ru, andres(at)anarazel(dot)de, simon(dot)riggs(at)enterprisedb(dot)com
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-04-15 08:29:13
Message-ID: 20220415.172913.833574004555290913.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 14 Apr 2022 11:02:33 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> It seems to me that whatever hazards exist must come from the fact
> that the operation is no longer fully atomic. The existing code
> acquires every relevant lock, then does the work, then releases locks.
> Ergo, we don't have to worry about concurrency because there basically
> can't be any. Stuff could be happening at the same time in other
> partitions that are entirely unrelated to what we're doing, but at the
> time we touch the two partitions we care about, we're the only one
> touching them. Now, if we do as proposed here, we will acquire one
> lock, release it, and then take the other lock, and that means that
> some operations could overlap that can't overlap today. Whatever gets
> broken must get broken because of that possible overlapping, because
> in the absence of concurrency, the end state is the same either way.
>
> So ... how could things get broken by having these operations overlap
> each other? The possibility that we might run out of buffer mapping
> entries is one concern. I guess there's also the question of whether
> the collision handling is adequate: if we fail due to a collision and
> handle that by putting the buffer on the free list, is that OK? And
> what if we fail midway through and the buffer doesn't end up either on
> the free list or in the buffer mapping table? I think maybe that's
> impossible, but I'm not 100% sure that it's impossible, and I'm not
> sure how bad it would be if it did happen. A permanent "leak" of a
> buffer that resulted in it becoming permanently unusable would be bad,

The patch removes buftable entry frist then either inserted again or
returned to freelist. I don't understand how it can be in both
buftable and freelist.. What kind of trouble do you have in mind for
example? Even if some underlying functions issued ERROR, the result
wouldn't differ from the current code. (It seems to me only WARNING or
PANIC by a quick look). Maybe to make us sure that it works, we need
to make sure the victim buffer is surely isolated. It is described as
the following.

* We are single pinner, we hold buffer header lock and exclusive
* partition lock (if tag is valid). It means no other process can inspect
* it at the moment.
*
* But we will release partition lock and buffer header lock. We must be
* sure other backend will not use this buffer until we reuse it for new
* tag. Therefore, we clear out the buffer's tag and flags and remove it
* from buffer table. Also buffer remains pinned to ensure
* StrategyGetBuffer will not try to reuse the buffer concurrently.

> for sure. But all of these issues seem relatively possible to avoid
> with sufficiently good coding. My intuition is that the buffer mapping
> table size limit is the nastiest of the problems, and if that's

I believe that still no additional entries are required in buftable.
The reason for expansion is explained as the follows.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote in
> First I found if table size is strictly limited to NBuffers and FIXED,
> then under high concurrency get_hash_entry may not find free entry
> despite it must be there. It seems while process scans free lists, other

The freelist starvation is caused from almost sigle-directioned
inter-freelist migration that this patch introduced. So it is not
needed if we neglect the slowdown (I'm not sure how much it is..)
caused by walking though all freelists. The inter-freelist migration
will stop if we pull out the HASH_REUSE feature from deynahash.

> resolvable then I'm not sure what else could be a hard blocker. I'm
> not saying there isn't anything, just that I don't know what it might
> be.
>
> To put all this another way, suppose that we threw out the way we do
> buffer allocation today and always allocated from the freelist. If the
> freelist is found to be empty, the backend wanting a buffer has to do
> some kind of clock sweep to populate the freelist with >=1 buffers,
> and then try again. I don't think that would be performant or fair,
> because it would probably happen frequently that a buffer some backend
> had just added to the free list got stolen by some other backend, but
> I think it would be safe, because we already put buffers on the
> freelist when relations or databases are dropped, and we allocate from
> there just fine in that case. So then why isn't this safe? It's
> functionally the same thing, except we (usually) skip over the
> intermediate step of putting the buffer on the freelist and taking it
> off again.

So, does this get progressed if someone (maybe Yura?) runs a
benchmarking with this method?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2022-04-15 09:16:44 Re: fix cost subqueryscan wrong parallel cost
Previous Message Fujii.Yuki@df.MitsubishiElectric.co.jp 2022-04-15 07:33:26 WIP: Aggregation push-down - take2