Re: WIP: WAL prefetch (another approach)

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-12-10 08:39:57
Message-ID: CAE9k0Pnw7sc9i2RvPkRqOFZz=2B6+ctahDK_SoKq0g0HYD9FMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Thomas,

I am unable to apply these new set of patches on HEAD. Can you please share
the rebased patch or if you have any work branch can you please point it
out, I will refer to it for the changes.

--
With Regards,
Ashutosh sharma.

On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <daniel(at)yesql(dot)se>
> wrote:
> > Could you post an updated version of the patch which is for review?
>
> Sorry for taking so long to come back; I learned some new things that
> made me want to restructure this code a bit (see below). Here is an
> updated pair of patches that I'm currently testing.
>
> Old problems:
>
> 1. Last time around, an infinite loop was reported in pg_waldump. I
> believe Horiguchi-san has fixed that[1], but I'm no longer depending
> on that patch. I thought his patch set was a good idea, but it's
> complicated and there's enough going on here already... let's consider
> that independently.
>
> This version goes back to what I had earlier, though (I hope) it is
> better about how "nonblocking" states are communicated. In this
> version, XLogPageRead() has a way to give up part way through a record
> if it doesn't have enough data and there are queued up records that
> could be replayed right now. In that case, we'll go back to the
> beginning of the record (and occasionally, back a WAL page) next time
> we try. That's the cost of not maintaining intra-record decoding
> state.
>
> 2. Last time around, we could try to allocate a crazy amount of
> memory when reading garbage past the end of the WAL. Fixed, by
> validating first, like in master.
>
> New work:
>
> Since last time, I went away and worked on a "real" AIO version of
> this feature. That's ongoing experimental work for a future proposal,
> but I have a working prototype and I aim to share that soon, when that
> branch is rebased to catch up with recent changes. In that version,
> the prefetcher starts actual reads into the buffer pool, and recovery
> receives already pinned buffers attached to the stream of records it's
> replaying.
>
> That inspired a couple of refactoring changes to this non-AIO version,
> to minimise the difference and anticipate the future work better:
>
> 1. The logic for deciding which block to start prefetching next is
> moved into a new callback function in a sort of standard form (this is
> approximately how all/most prefetching code looks in the AIO project,
> ie sequential scans, bitmap heap scan, etc).
>
> 2. The logic for controlling how many IOs are running and deciding
> when to call the above is in a separate component. In this non-AIO
> version, it works using a simple ring buffer of LSNs to estimate the
> number of in flight I/Os, just like before. This part would be thrown
> away and replaced with the AIO branch's centralised "streaming read"
> mechanism which tracks I/O completions based on a stream of completion
> events from the kernel (or I/O worker processes).
>
> 3. In this version, the prefetcher still doesn't pin buffers, for
> simplicity. That work did force me to study places where WAL streams
> need prefetching "barriers", though, so in this patch you can
> see that it's now a little more careful than it probably needs to be.
> (It doesn't really matter much if you call posix_fadvise() on a
> non-existent file region, or the wrong file after OID wraparound and
> reuse, but it would matter if you actually read it into a buffer, and
> if an intervening record might be trying to drop something you have
> pinned).
>
> Some other changes:
>
> 1. I dropped the GUC recovery_prefetch_fpw. I think it was a
> possibly useful idea but it's a niche concern and not worth worrying
> about for now.
>
> 2. I simplified the stats. Coming up with a good running average
> system seemed like a problem for another day (the numbers before were
> hard to interpret). The new stats are super simple counters and
> instantaneous values:
>
> postgres=# select * from pg_stat_prefetch_recovery ;
> -[ RECORD 1 ]--+------------------------------
> stats_reset | 2021-11-10 09:02:08.590217+13
> prefetch | 13605674 <- times we called posix_fadvise()
> hit | 24185289 <- times we found pages already cached
> skip_init | 217215 <- times we did nothing because init, not read
> skip_new | 192347 <- times we skipped because relation too small
> skip_fpw | 27429 <- times we skipped because fpw, not read
> wal_distance | 10648 <- how far ahead in WAL bytes
> block_distance | 134 <- how far ahead in block references
> io_depth | 50 <- fadvise() calls not yet followed by pread()
>
> I also removed the code to save and restore the stats via the stats
> collector, for now. I figured that persistent stats could be a later
> feature, perhaps after the shared memory stats stuff?
>
> 3. I dropped the code that was caching an SMgrRelation pointer to
> avoid smgropen() calls that showed up in some profiles. That probably
> lacked invalidation that could be done with some more WAL analysis,
> but I decided to leave it out completely for now for simplicity.
>
> 4. I dropped the verbose logging. I think it might make sense to
> integrate with the new "recovery progress" system, but I think that
> should be a separate discussion. If you want to see the counters
> after crash recovery finishes, you can look at the stats view.
>
> [1] https://commitfest.postgresql.org/34/2113/
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-12-10 09:13:31 Re: Replication slot drop message is sent after pgstats shutdown.
Previous Message kuroda.hayato@fujitsu.com 2021-12-10 07:35:04 RE: Allow escape in application_name