Quick Links

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Date:	2024-03-05 11:07:01
Message-ID:	CAN55FZ3pZs77=6Na__eK23f_UvZfZAgxSUexHVrnqLvwoqNJ8w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Thanks for working on this!

The patches are cleanly applied on top of the current master and all
tests are passed.

On Thu, 4 Jan 2024 at 02:23, Cedric Villemain
<Cedric(dot)Villemain+pgsql(at)abcsql(dot)com> wrote:
>
> Hi,
>
> I wonder what you think of making pg_prewarm use recent addition on
> smgrprefetch and readv ?
>
>
> In order to try, I did it anyway in the attached patches. They contain
> no doc update, but I will proceed if it is of interest.
>
> In summary:
>
> 1. The first one adds a new check on parameters (checking last block is
> indeed not before first block).
> Consequence is an ERROR is raised instead of silently doing nothing.

This is a general improvement and can be committed without other patches.

> 2. The second one does implement smgrprefetch with range and loops by
> default per segment to still have a check for interrupts.

It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1] states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

> 3. The third one provides smgrreadv instead of smgrread, by default on
> a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Looks good to me.

> Q: posix_fadvise may not work exactly the way you think it does, or does
> it ?
>
>
> In details, and for the question:
>
> It's not so obvious that the "feature" is really required or wanted,
> depending on what are the expectations from user point of view.
>
> The kernel decides on what to do with posix_fadvise calls, and how we
> pass parameters does impact the decision.
> With the current situation where prefetch is done step by step, block by
> block, they are very probably most of the time all loaded even if those
> from the beginning of the relation can be discarded at the end of the
> prefetch.
>
> However, if instead you provide a real range, or the magic len=0 to
> posix_fadvise, then blocks are "more" loaded according to effective vm
> pressure (which is not the case on the previous example).
> As a result only a small part of the relation might be loaded, and this
> is probably not what end-users expect despite being probably a good
> choice (you can still free cache beforehand to help the kernel).
>
> An example, below I'm using vm_relation_cachestat() which provides linux
> cachestat output, and vm_relation_fadvise() to unload cache, and
> pg_prewarm for the demo:
>
> # clear cache: (nr_cache is the number of file system pages in cache,
> not postgres blocks)
>
> ```
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -------------+-------------+----------+----------
> 0 | 32768 | 65536 | 0
> 32768 | 32768 | 65536 | 0
> 65536 | 32768 | 65536 | 0
> 98304 | 32768 | 65536 | 0
> 131072 | 1672 | 3344 | 0
> ```
>
> # load full relation with pg_prewarm (patched)
>
> ```
> postgres=# select pg_prewarm('foo','prefetch');
> pg_prewarm
> ------------
> 132744
> (1 row)
> ```
>
> # Checking results:
>
> ```
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -------------+-------------+----------+----------
> 0 | 32768 | 65536 | 320
> 32768 | 32768 | 65536 | 0
> 65536 | 32768 | 65536 | 0
> 98304 | 32768 | 65536 | 0
> 131072 | 1672 | 3344 | 320 <-- segment 1
>
> ```
>
> # Load block by block and check:
>
> ```
> postgres=# select from generate_series(0, 132743) g(n), lateral
> pg_prewarm('foo','prefetch', 'main', n, n);
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -------------+-------------+----------+----------
> 0 | 32768 | 65536 | 65536
> 32768 | 32768 | 65536 | 65536
> 65536 | 32768 | 65536 | 65536
> 98304 | 32768 | 65536 | 65536
> 131072 | 1672 | 3344 | 3344
>
> ```
>
> The duration of the last example is also really significant: full
> relation is 0.3ms and block by block is 1550ms!
> You might think it's because of generate_series or whatever, but I have
> the exact same behavior with pgfincore.
> I can compare loading and unloading duration for similar "async" work,
> here each call is from block 0 with len of 132744 and a range of 1 block
> (i.e. posix_fadvise on 8kB at a time).
> So they have exactly the same number of operations doing DONTNEED or
> WILLNEED, but distinct duration on the first "load":
>
> ```
>
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
> vm_relation_fadvise
> ---------------------
>
> (1 row)
>
> Time: 25.202 ms
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> vm_relation_fadvise
> ---------------------
>
> (1 row)
>
> Time: 1523.636 ms (00:01.524) <----- not free !
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> vm_relation_fadvise
> ---------------------
>
> (1 row)
>
> Time: 24.967 ms
> ```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED at 2024-01-03 23:23:43 from Cedric Villemain

Responses

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED at 2024-03-06 15:23:01 from Cédric Villemain

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2024-03-05 11:18:05	Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Previous Message	Daniel Gustafsson	2024-03-05 10:50:36	Re: Adding deprecation notices to pgcrypto documentation