Re: Confine vacuum skip logic to lazy_scan_skip

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Confine vacuum skip logic to lazy_scan_skip
Date: 2025-01-18 16:51:08
Message-ID: 85696b8e-f1bf-459e-ba97-5608c644c185@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/16/25 02:45, Melanie Plageman wrote:
> On Sun, Dec 15, 2024 at 10:10 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> I've been looking at some other vacuum-related patches, so I took a look
>> at these remaining bits too. I don't have much to say about the code
>> (seems perfectly fine to me), so I decided to do a bit of testing.
>
> Thanks for doing this!
>
>> I did a simple test (see the attached .sh script) that runs vacuum on a
>> table with different fractions of rows updated. The table has ~100 rows
>> per page, with 50% fill factor, and the updates touch ~1%, 0.1%, 0.05%
>> rows, and then even smaller fractions up to 0.0001%. This determines how
>> many pages get touched. With 1% fraction almost every page gets
>> modified, then the fraction quickly drops. With 0.0001% only about 0.01%
>> of pages gets modified.
>>
>> Attached is a CSV with raw results from two machines, and also a PDF
>> with a comparison of the two build (master vs. patched). In vast
>> majority of cases, the patched build is much faster, usually 2-3x.
>>
>> There are a couple cases where it regresses by ~30%, but only on one of
>> the machines with older storage (RAID on SATA SSDs), with 1% rows
>> updated (which means almost all pages get modified). So this is about
>> sequential access. It's a bit weird, but probably not a fatal issue.
>
> Actually, while rebasing these with the intent to start investigating
> the regressions you report, I noticed something quite wrong with my
> code. In lazy_scan_heap(), I had put read_stream_next_buffer() before
> a few expensive operations (like a round of index vacuuming and dead
> item reaping if the TID store is full). It returns the pinned buffer,
> so this could mean a buffer remaining pinned for a whole round of
> vacuuming of items from the TID store. Not good. Anyway, this version
> has fixed that. I do wonder if there is any chance that this affected
> your benchmarks.
>
> I've attached a new v13. Perhaps you could give it another go and see
> if the regressions are still there?
>

Sure. I repeated the benchmark with v13, and it seems the behavior did
change. I no longer see the "big" regression when most of the pages get
updated (and need vacuuming).

I can't be 100% sure this is due to changes in the patch, because I did
some significant upgrades to the machine since that time - it has Ryzen
9900x instead of the ancient i5-2500k, new mobo/RAM/... It's pretty
much a new machine, I only kept the "old" SATA SSD RAID storage so that
I can do some tests with non-NVMe.

So there's a (small) chance the previous runs were hitting a bottleneck
that does not exist on the new hardware.

Anyway, just to make this information more complete, the machine now has
this configuration:

* Ryzen 9 9900x (12/24C), 64GB RAM
* storage:
- data: Samsung SSD 990 PRO 4TB (NVMe)
- raid-nvme: RAID0 4x Samsung SSD 990 PRO 1TB (NVMe)
- raid-sata: RAID0 6x Intel DC3700 100GB (SATA)

Attached is the script, raw results (CSV) and two PDFs summarizing the
results as a pivot table for different test parameters. Compared to the
earlier run I tweaked the script to also vary io_combine_limit (ioc), as
I wanted to see how it interacts with effective_io_concurrency (eic).

Looking at the new results, I don't see any regressions, except for two
cases - data (single NVMe) and raid-nvme (4x NVMe). There's a small area
of regression for eic=32 and perc=0.0005, but only with WAL-logging.

I'm not sure this is worth worrying about too much. It's a heuristics
and for every heuristics there's some combination parameters where it
doesn't quite do the optimal thing. The area where the patch brings
massive improvements (or does not regress) are much more significant.

I personally am happy with this behavior, seems to be performing fine.

>> There is also a couple smaller regressions on the "xeon" machine with
>> M.2 SSD, for lower update fractions - 0.05% rows updated means 5% pages
>> need vacuum. But I think this is more a sign of us being too aggressive
>> in detecting (and forcing) sequential patterns - on master, we end up
>> scanning 50% pages, thanks to SKIP_PAGES_THRESHOLD. I'll start a new
>> thread about that ...
>
> Hmm. If only 5% of pages need vacuum, then you are right, readahead
> seems like it would often be wasted (depending on _which_ 5% needs
> vacuuming). But, the read stream API will only prefetch and build up
> larger I/Os of blocks we actually need. So, it seems like this would
> behave the same on master and with the patch. That is, both would do
> extra unneeded I/O because of SKIP_PAGES_THRESHOLD. Is the 5% of the
> table that needs vacuuming dispersed randomly throughout or
> concentrated?
>
> If it is concentrated and readahead would be useful, then maybe we
> need to increase read_ahead_kb. You mentioned off-list that the
> read_ahead_kb on this machine for this SSD was 128kB -- the same as
> io_combine_limit. If we want to read ahead, issuing 128 kB I/Os might
> be thwarting us and increasing latency.
>

I haven't ran the new benchmark on the "xeon" machine, but I believe
this is the same regression that I mentioned above. I mean, it's the
same "area" where it happens, also for NVMe storage, so I guess it's has
the same cause. But I don't have any particular insight into this.

FWIW there's one more interesting thing - entirely unrelated to this
patch, but visible in the "eic vs. ioc" results, which are meant to show
how these two GUC interact. For (eic <= 1) the ioc doesn't seem to
matter very much, but for (eic > 1) it's clear higher ioc values are
better. Except that it "breaks" at 512kB, because I didn't realize we
allow ioc only up to 256kB. So 512kB does nothing, it just defaults to
the 128kB limit.

So this brings two questions:

* Does it still make sense to default to eic=1? For this particular test
increasing eic=4 often cuts the duration in half (especially on nvme
storage).

* Why are we limiting ioc to <= 256kB? Per the benchmark it seems it
might be beneficial to set even higher values.

regards

--
Tomas Vondra

Attachment Content-Type Size
ioc-eic.pdf application/pdf 63.4 KB
perc.pdf application/pdf 55.0 KB
results.csv.gz application/gzip 293.0 KB
run-vacuum-stream.sh application/x-shellscript 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-01-18 17:00:04 Re: Statistics Import and Export
Previous Message David E. Wheeler 2025-01-18 16:47:19 Re: abi-compliance-checker