Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Hannu Krosing <hannuk(at)google(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2021-07-08 23:21:25
Message-ID: CAMT0RQTNq1qdw_7xV2PJ+DbQTMxZiKZ2e-GXqDNKm+MU-34Ybg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 9, 2021 at 12:34 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
...
>
> I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
> leaf page is not uncommon (with deduplication). Seems like that might
> be enough?

Likely yes, and also it would have the nice property of not changing
the index page locking behaviour.

Are deduplicated tids in the leaf page already sorted in heap order ?
This could potentially simplify / speed up the sort.

> > I have not measured anything yet, but one of my concerns in case of
> > very large dead tuple collections searched by 8-way parallel bsearch
> > could actually get close to saturating RAM bandwidth by reading (8 x
> > 32bits x cache-line-size) bytes from main memory every few cycles, so
> > we may need some inner-loop level throttling similar to current
> > vacuum_cost_limit for data pages.
>
> If it happens then it'll be a nice problem to have, I suppose.
>
> > Maybe not unrolling the full 32 loops for 32 bit bserach, but
> > something like 8-loop unroll for getting most of the benefit.
>
> My current assumption is that we're bound by memory speed right now,

Most likely yes, and this should be also easy to check with manually
unrolling perhaps 4 loops and measuring any speed increase.

> and that that is the big bottleneck to eliminate -- we must keep the
> CPU busy with data to process first. That seems like the most
> promising thing to focus on right now.

This has actually two parts
- trying to make sure that we can make as much as possible from cache
- if we need to get out of cache then try to parallelise this as
much as possible

at the same time we need to watch that we are not making the index
tuple preparation work so heavy that it starts to dominate over memory
access

> > While it may make sense to have different bitmap encodings for
> > different distributions, it likely would not be good for optimisations
> > if all these are used at the same time.
>
> To some degree designs like Roaring bitmaps are just that -- a way of
> dynamically figuring out which strategy to use based on data
> characteristics.

it is, but as I am keeping one eye open for vectorisation, I don't
like when different parts of the same bitmap have radically different
encoding strategies.

> > This is why I propose the first bitmap collecting phase to collect
> > into a file and then - when reading into memory for lookups phase -
> > possibly rewrite the initial structure to something else if it sees
> > that it is more efficient. Like for example where the first half of
> > the file consists of only empty pages.
>
> Yeah, I agree that something like that could make sense. Although
> rewriting it doesn't seem particularly promising,

yeah, I hope to prove (or verify :) ) the structure is good enough so
that it does not need the rewrite.

> since we can easily
> make it cheap to process any TID that falls into a range of blocks
> that have no dead tuples.

I actually meant the opposite case, where we could replace a full 80
bytes 291-bit "all dead" bitmap with just a range - int4 for page and
two int2-s for min and max tid-in page for extra 10x reduction, on top
of original 21x reduction from current 6 bytes / bit encoding to my
page_bsearch_vector bitmaps which encodes one page to maximum of 80
bytes (5 x int4 sub-page pointers + 5 x int4 bitmaps).

I also started out by investigating RoaringBitmaps, but when I
realized that we will likely have to rewrite it anyway I continued
working on getting to a single uniform encoding which fits most use
cases Good Enough and then use that uniformity to enable the compiler
to do its optimisation and hopefully also vectoriziation magic.

> We don't need to rewrite the data structure
> to make it do that well, AFAICT.
>
> When I said that I thought of this a little like a hash join, I was
> being more serious than you might imagine. Note that the number of
> index tuples that VACUUM will delete from each index can now be far
> less than the total number of TIDs stored in memory. So even when we
> have (say) 20% of all of the TIDs from the table in our in memory list
> managed by vacuumlazy.c, it's now quite possible that VACUUM will only
> actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
> it finds in the index (there really is no way to predict how many).
> The opportunistic deletion stuff could easily be doing most of the
> required cleanup in an eager fashion following recent improvements --
> VACUUM need only take care of "floating garbage" these days.

Ok, this points to the need to mainly optimise for quite sparse
population of dead tuples, which is still mainly clustered page-wise ?

> In other
> words, thinking about this as something that is a little bit like a
> hash join makes sense because hash joins do very well with high join
> selectivity, and high join selectivity is common in the real world.
> The intersection of TIDs from each leaf page with the in-memory TID
> delete structure will often be very small indeed.

The hard to optimize case is still when we have dead tuple counts in
hundreds of millions, or even billions, like on a HTAP database after
a few hours of OLAP query have accumulated loads of dead tuples in
tables getting heavy OLTP traffic.

There of course we could do a totally different optimisation, where we
also allow reaping tuples newer than the OLAP queries snapshot if we
can prove that when the snapshot moves forward next time, it has to
jump over said transactions making them indeed DEAD and not RECENTLY
DEAD. Currently we let a single OLAP query ruin everything :)

> > Then again it may be so much extra work that it starts to dominate
> > some parts of profiles.
> >
> > For example see the work that was done in improving the mini-vacuum
> > part where it was actually faster to copy data out to a separate
> > buffer and then back in than shuffle it around inside the same 8k page
>
> Some of what I'm saying is based on the experience of improving
> similar code used by index tuple deletion in Postgres 14. That did
> quite a lot of sorting of TIDs and things like that. In the end the
> sorting had no more than a negligible impact on performance.

Good to know :)

> What
> really mattered was that we efficiently coordinate with distant heap
> pages that describe which index tuples we can delete from a given leaf
> page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
> locations in memory (or even far fewer) is not so cheap. It might even
> be very slow indeed. Sorting in order to batch could end up looking
> like cheap insurance that we should be glad to pay for.

If the most expensive operation is sorting a few hundred of tids, then
this should be fast enough.

My worries were more that after the sorting we can not to dsimple
index lookups for them, but each needs to be found via bseach or maybe
even just search if that is faster under some size limit, and that
these could add up. Or some other needed thing that also has to be
done, like allocating extra memory or moving other data around in a
way that CPU does not like.

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2021-07-08 23:42:18 Re: PROXY protocol support
Previous Message Peter Smith 2021-07-08 23:12:48 Re: [HACKERS] logical decoding of two-phase transactions