Re: lru_multiplier and backend page write-outs

From: Peter Schuller <peter(dot)schuller(at)infidyne(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: lru_multiplier and backend page write-outs
Date: 2008-11-06 23:04:01
Message-ID: 20081106230400.GA95142@hyperion.scode.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> > no table was ever large enough that 256k buffers would ever be filled by
> > the process of vacuuming a single table.
>
> Not 256K buffers--256K, 32 buffers.

Ok.

> > In addition, when I say "constantly" above I mean that the count
> > increases even between successive SELECT:s (of the stat table) with
> > only a second or two in between.
>
> Writes to the database when only doing read operations are usually related
> to hint bits: http://wiki.postgresql.org/wiki/Hint_Bits

Sorry, I didn't mean to imply read-only operations (I did read the
hint bits information a while back though). What I meant was that
while I was constantly generating the insert/delete/update activity, I
was selecting the bg writer stats with only a second or two in
between. The intent was to convey that the count of backend written
pages was systematically and constantly (as in a few hundreds per
handful of seconds) increasing, in spite of no long running vacuum and
the buffer cache not being close to full.

> > On this topic btw, was it considered to allow the administrator to
> > specify a fixed-size margin to use when applying the JIT policy?
>
> Right now, there's no way to know exactly what's in the buffer cache
> without scanning the individual buffers, which requires locking their
> headers so you can see them consistently. No one process can get the big
> picture without doing something intrusive like that, and on a busy system
> the overhead of collecting more data to know how exactly far ahead the
> cleaning is can drag down overall performance. A lot can happen while the
> background writer is sleeping.

Understood.

> One next-generation design which has been sketched out but not even
> prototyped would take cleaned buffers and add them to the internal list of
> buffers that are free, which right now is usually empty on the theory that
> cached data is always more useful than a reserved buffer. If you
> developed a reasonable model for how many buffers you needed and padded
> that appropriately, that's the easiest way (given the rest of the buffer
> manager code) to get close to ensuring there aren't any backend writes.
> Because you've got the OS buffering writes anyway in most cases, it's hard
> to pin down whether that actually improved worst-case latency though. And
> moving in that direction always seems to reduce average throughput even in
> write-heavy benchmarks.

Ok.

> The important thing to remember is that the underlying OS has its own read
> and write caching mechanisms here, and unless the PostgreSQL ones are
> measurably better than those you might as well let the OS manage the
> problem instead.

The problem though is that though the OS may be good in the common
cases it is designed for, it can have specific features that are
directly counter-productive if your goals do not line up with that of
the commonly designed-for use case (in particular, if you care about
latency a lot and not necessarily about absolute max throughput).

For example, in Linux up until recently if not still, there is the
1024 per-inode buffer limit that limited the number of buffers written
as a result of expiry, which means that when PostgreSQL does its
fsync(), you may end up having a lot more to write out than what would
have been the case if the centisecs_expiry had been enforced,
regardless of whether PostgreSQL was tuned to write dirty pages out
sufficiently aggressively. If the amount built up exceeds the capacity
of the RAID controller cache...

I had a case where I suspect this was exaserbating the
situation. Manually doing a 'sync' on the system every few seconds
noticably helped (the theory being, because it forced page write-outs
to happen earlier and in smaller storms).

> It's easy to demonstrate that's happening when you give
> a decent amount of memory to shared_buffers, it's much harder to prove
> that's the case for an improved write scheduling algorithm. Stepping back
> a bit, you might even consider that one reason PostgreSQL has grown as
> well as it has in scalability is exactly because it's been riding
> improvements the underlying OS in many of these cases, rather than trying
> to do all the I/O scheduling itself.

Sure. In this case with the backend writes, I am nore interesting in
understanding better what is happening and having better indications
of when backends block on I/O, than necessarily having a proven
improvement in throughput or latency. It makes it easier to reason
about what is happening when you *do* have a measured performance
problem.

Thanks for all the insightful information.

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter(dot)schuller(at)infidyne(dot)com>'
Key retrieval: Send an E-Mail to getpgpkey(at)scode(dot)org
E-Mail: peter(dot)schuller(at)infidyne(dot)com Web: http://www.scode.org

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2008-11-06 23:04:03 Re: Create and drop temp table in 8.3.4
Previous Message Scott Marlowe 2008-11-06 22:45:57 Re: Create and drop temp table in 8.3.4