From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Date: | 2019-09-16 18:58:50 |
Message-ID: | CAH2-Wzko0R8baJStEQWmwfUEvvS6JmozJ4X7RU8arzZdZjay5w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> Attached is v14 based on v12 (v13 changes are not merged).
>
> In this version, I fixed the bug you mentioned and also fixed nbtinsert,
> so that it doesn't save newposting in xlog record anymore.
Cool.
> I tested patch with nbtree_wal_test, and found out that the real issue is
> not the dedup WAL records themselves, but the full page writes that they trigger.
> Here are test results (config is standard, except fsync=off to speedup tests):
>
> 'FPW on' and 'FPW off' are tests on v14.
> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.
> With random insertions into btree it's highly possible that deduplication will often be
> the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.
I find that hard to believe. Deduplication only occurs when we're
about to split the page. If that's almost as likely to occur as a
simple insert, then we're in big trouble (maybe it's actually true,
but if it is then that's the real problem). Also, fewer pages for the
index naturally leads to far fewer FPIs after a checkpoint.
I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the
same case on v13. It was practically the same as the master branch,
apart from the huge difference in FPIs for the XLOG rmgr. Aside from
that one huge difference, there was a similar volume of the same types
of WAL records in each case. Mostly leaf inserts, and far fewer
internal page inserts. I suppose this isn't surprising.
It probably makes sense for the final version of the patch to increase
the volume of WAL a little overall, since the savings for internal
page stuff cannot make up for the cost of having to WAL log something
extra (deduplication operations) on leaf pages, regardless of the size
of those extra dedup WAL records (I am ignoring FPIs after a
checkpoint in this analysis). So the patch is more or less certain to
add *some* WAL overhead in cases that benefit, and that's okay. But,
it adds way too much WAL overhead today (even in v14), for reasons
that we don't understand yet, which is not okay.
I may have misunderstood your approach to WAL-logging in v12. I
thought that you were WAL-logging things that didn't change, which
doesn't seem to be the case with v14. I thought that v12 was very
similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does
its WAL-logging. v14 looks good, though.
"pg_waldump -z" and "pg_waldump --stats=record" will break down the
contributing factor of FPIs, so it should be possible to account for
the overhead in the test case exactly. We can debug the problem by
using pg_waldump to count the absolute number of each type of record,
and the size of each type of record.
(Thinks some more...)
I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():
/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);
Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().
> That's why there is no significant difference with log_newpage_buffer() approach.
> And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:
1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).
2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.
3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.
4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.
We should try to be careful when using these terms, as there is a very
real danger of talking past each other.
> Another, and more realistic approach is to make deduplication less intensive:
> if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.
I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.
> Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
> not because deduplication itself is cpu bound or something, but because WAL load decreased.
I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.
We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.
I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.
Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the problem. It's possible that we
could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
but not use it on rightmost pages, and get the best of both worlds.
IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
debug that when the time comes.
> So I propose to develop this idea. The question is how to choose threshold.
> I wouldn't like to introduce new user settings. Any ideas?
I think that there should be a target fill factor that sometimes makes
deduplication leave a small amount of free space. Maybe that means
that the last posting list on the page is made a bit smaller than the
other ones. It should be "goal orientated".
The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:
> + /*
> + * Project size of new posting list that would result from merging
> + * current tup with pending posting list (could just be prev item
> + * that's "pending").
> + *
> + * This accounting looks odd, but it's correct because ...
> + */
> + projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
> + (dedupState->ntuples + itup_ntuples + 1) *
> + sizeof(ItemPointerData));
Why the "+1" here?
I have significantly refactored the _bt_dedup_one_page() loop in a way
that seems like a big improvement. It allowed me to remove all of the
small palloc() calls inside the loop, apart from the
BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems
to have shaved about 2 seconds off the "land" unlogged table test,
which was originally about 1 minute 2 seconds with v13's 0001-* patch
(and without v13's 0002-* patch).
It seems like can easily be integrated with the approach to WAL
logging taken in v14, so everything can be integrated soon. I'll work
on that.
> I also noticed that the number of checkpoints differ between tests:
> select checkpoints_req from pg_stat_bgwriter ;
> And I struggle to explain the reason of this.
> Do you understand what can cause the difference?
I imagine that the additional WAL volume triggered a checkpoint
earlier than in the more favorable test, which indirectly triggered
more FPIs, which contributed to triggering a checkpoint even
earlier...and so on. Synthetic test cases can avoid this. A useful
synthetic test should have no checkpoints at all, so that we can see
the broken down costs, without any second order effects that add more
cost in weird ways.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2019-09-16 19:02:41 | Re: block-level incremental backup |
Previous Message | Paul A Jungwirth | 2019-09-16 17:54:03 | Re: range test for hash index? |