Re: bgwrite process is too lazy

From: Andres Freund <andres(at)anarazel(dot)de>
To: wenhui qiu <qiuwenhuifx(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Tony Wayne <anonymouslydark3(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwrite process is too lazy
Date: 2024-10-04 17:49:23
Message-ID: cixso3buqeddrsqh3cf4svus3dakho2jwvohstwz64aqttg647@pqd4kwtdcso7
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-10-04 09:31:45 +0800, wenhui qiu wrote:
> > It's implied, but to make it more explicit: One big efficiency advantage
> of
> > writes by checkpointer is that they are sorted and can often be combined
> into
> > larger writes. That's often a lot more efficient: For network attached
> storage
> > it saves you iops, for local SSDs it's much friendlier to wear leveling.
>
> thank you for explanation, I think bgwrite also can merge io ,It writes
> asynchronously to the file system cache, scheduling by os, .

Because bgwriter writes are just ordered by their buffer id (further made less
sequential due to only writing out not-recently-used buffers), they are often
effectively random. The OS can't do much about that.

> > Another aspect is that checkpointer's writes are much easier to pace over
> time
> > than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> > signal. Eventually we'll want to combine writes by bgwriter too, but
> that's
> > always going to be more expensive than doing it in a large batched fashion
> > like checkpointer does.
>
> > I think we could improve checkpointer's pacing further, fwiw, by taking
> into
> > account that the WAL volume at the start of a spread-out checkpoint
> typically
> > is bigger than at the end.
>
> I'm also very keen to improve checkpoints , Whenever I do stress test,
> bgwrite does not write dirty pages when the data set is smaller than
> shard_buffer size,

It *SHOULD NOT* do anything in that situation. There's absolutely nothing to
be gained by bgwriter writing in that case.

> Before the checkpoint, the pressure measurement tps was stable and the
> highest during the entire pressure measurement phase,Other databases
> refresh dirty pages at a certain frequency, at intervals, and at dirty page
> water levels,They have a much smaller impact on performance when
> checkpoints occur

I doubt that slowdown is caused by bgwriter not being active enough. I suspect
what you're seeing is one or more of:

a) The overhead of doing full page writes (due to increasing the WAL
volume). You could verify whether that's the case by turning
full_page_writes off (but note that that's not generally safe!) or see if
the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
(don't use pglz, it's too slow).

b) The overhead of renaming WAL segments during recycling. You could see if
this is related by specifying --wal-segsize 512 or such during initdb.

Greetings,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2024-10-04 18:00:00 Re: POC, WIP: OR-clause support for indexes
Previous Message Peter Geoghegan 2024-10-04 17:43:52 Re: POC, WIP: OR-clause support for indexes