Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-02-16 09:22:56
Message-ID: CAFBsxsGCv8+v6d6Fgi+K0Re+CXsW_m9uTimZL5Kt7asUJKgz4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
>
> On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> <john(dot)naylor(at)enterprisedb(dot)com> wrote:

> > > I can think that something like traversing a HOT chain could visit
> > > offsets out of order. But fortunately we prune such collected TIDs
> > > before heap vacuum in heap case.
> >
> > Further, currently we *already* assume we populate the tid array in
order (for binary search), so we can just continue assuming that (with an
assert added since it's more public in this form). I'm not sure why such
basic common sense evaded me a few versions ago...
>
> Right. TidStore is implemented not only for heap, so loading
> out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.

> We have the following WIP comment in test_radixtree:
>
> // WIP: compiles with warnings because rt_attach is defined but not used
> // #define RT_SHMEM
>
> How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> and friends?

Sounds good to me, and the other fixes make sense as well.

> FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> seems to work fine.

That was on my list, so great! How about the other end -- nominally we
allow 512b. (In practice it won't matter, but this would make sure I didn't
mess anything up when forcing all MaxTuplesPerPage to encode.)

> You removed the vacuum integration patch from v27, is there any reason
for that?

Just an oversight.

Now for some general comments on the tid store...

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory
associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

Do we need to do anything for this todo?

It might help readability to have a concept of "off_upper/off_lower", just
so we can describe things more clearly. The key is block + off_upper, and
the value is a bitmap of all the off_lower bits. I hinted at that in my
addition of encode_key_off(). Along those lines, maybe
s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
sure the TIDSTORE_ prefix is valuable for these local macros.

The word "value" as a variable name is pretty generic in this context, and
it might be better to call it the off_lower_bitmap, at least in some
places. The "key" doesn't have a good short term for naming, but in
comments we should make sure we're clear it's "block# + off_upper".

I'm not a fan of the name "tid_i", even as a temp variable -- maybe
"compressed_tid"?

maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/

It might be worth using typedefs for key and value type. Actually, since
key type is fixed for the foreseeable future, maybe the radix tree template
should define a key typedef?

The term "result" is probably fine within the tidstore, but as a public
name used by vacuum, it's not very descriptive. I don't have a good idea,
though.

Some files in backend/access use CamelCase for public functions, although
it's not consistent. I think doing that for tidstore would help
readability, since they would stand out from rt_* functions and vacuum
functions. It's a matter of taste, though.

I don't understand the control flow in tidstore_iterate_next(), or when
BlockNumberIsValid() is true. If this is the best way to code this, it
needs more commentary.

Some comments on vacuum:

I think we'd better get some real-world testing of this, fairly soon.

I had an idea: If it's not too much effort, it might be worth splitting it
into two parts: one that just adds the store (not caring about its memory
limits or progress reporting etc). During index scan, check both the new
store and the array and log a warning (we don't want to exit or crash,
better to try to investigate while live if possible) if the result doesn't
match. Then perhaps set up an instance and let something like TPC-C run for
a few days. The second patch would just restore the rest of the current
patch. That would help reassure us it's working as designed. Soon I plan to
do some measurements with vacuuming large tables to get some concrete
numbers that the community can get excited about.

We also want to verify that progress reporting works as designed and has no
weird corner cases.

* autovacuum_work_mem) memory space to keep track of dead TIDs. We
initially
...
+ * create a TidStore with the maximum bytes that can be used by the
TidStore.

This kind of implies that we allocate the maximum bytes upfront. I think
this sentence can be removed. We already mentioned in the previous
paragraph that we set an upper bound.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in
%u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));

I don't think the format string has to change, since num_tids was changed
back to int64 in an earlier patch version?

- * the memory space for storing dead items allocated in the DSM segment.
We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of

The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/
or something else minor.

- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);

If we're starting from the minimum, "estimate" doesn't really describe it
anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.

Lastly, on the radix tree:

I find extend, set, and set_extend hard to keep straight when studying the
code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?

RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it
into a template.

+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);

+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)

These comments don't really help readers unfamiliar with the code. The
iteration coding in general needs clearer description.

In the test:

+ 4, /* RT_NODE_KIND_4 */

The small size was changed to 3 -- if this test needs to know the max size
for each kind (class?), I wonder why it didn't fail. Should it? Maybe we
need symbols for the various fanouts.

I also want to mention now that we better decide soon if we want to support
shrinking of nodes for v16, even if the tidstore never shrinks. We'll need
to do it at some point, but I'm not sure if doing it now would make more
work for future changes targeting highly concurrent workloads. If so, doing
it now would just be wasted work. On the other hand, someone might have a
use that needs deletion before someone else needs concurrency. Just in
case, I have a start of node-shrinking logic, but needs some work because
we need the (local pointer) parent to update to the new smaller node, just
like the growing case.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melih Mutlu 2023-02-16 09:33:37 Re: Allow logical replication to copy tables in binary format
Previous Message Melih Mutlu 2023-02-16 09:18:53 Re: Allow logical replication to copy tables in binary format