Re: shared_buffers 8GB maximum

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vitaliy Garnashevich <vgarnashevich(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: shared_buffers 8GB maximum
Date: 2018-02-17 04:05:41
Message-ID: CAMkU=1x=ja88ORTXgh9GqGr4qanicBEeGu87fF8WFL7HVnKiYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Feb 16, 2018 at 2:36 PM, Vitaliy Garnashevich <
vgarnashevich(at)gmail(dot)com> wrote:

> Hi All,
>
> I've seen the shared_buffers 8GB maximum recommendation repeated many
> times. I have several questions in this regard.
>
> - Is this recommendation still true for recent versions of postgres? (e.g.
> wasn't it the case only for really old versions where the locks on shared
> buffers worked much less efficiently)
>

There were improvements in 9.3 around things like cleaning the buffer pool
when tables were dropped or truncated, particular when many were dropped or
truncated in the same transaction. This reduced a major penalty for very
large shared_buffers, but did not reduce it to zero. The clock-sweep
method for buffer eviction was made lockless using atomics in 9.5, but I
think that was more about concurrency than size of shared_buffers.

>
> - I'm not a huge Linux expert, but I've heard someone saying that reading
> from the filesystem cache requires a context switch. I suspect that such
> reads are slightly more expensive now after the Meltdown/Spectre patch in
> the kernel. Could that be a reason for increasing the value of
> shared_buffers?
>

Yes. I don't know the exact reason, but reading a buffer from OS cache is
quite a bit more expensive than just pinning a buffer already in the
buffer_pool, about 5 times more expensive the last time I tested it, which
was before Meltdown. (And just pinning a buffer which is already in the
cache is already pretty expensive--about 15 times as expensive as reading
the next tuple from an already-pinned buffer).

>
> - Could shared_buffers=128GB or more on a 250 GB RAM server be a
> reasonable setting? What downsides could there be?
>
>
The worst side effect I have from large shared_buffers in recent versions
(other than swapping to death when you don't have the RAM to support it) is
a pathological use case in which someone creates a table, and then in the
same transaction keeps starting COPY to insert a small number of rows and
then ending the COPY. If the COPY decides to skip wal logging for that
table (because it was created in the same transaction and so on a crash the
table will not exist anymore) then it needs to scrub the shared_buffers for
every COPY end, which is slow with large shared_buffers.

You could also worry that the OS won't have enough memory left in its own
cache with which to buffer dirty buffers and re-order or combine writes for
more efficient writing to disk. But in my experience, the kernel is
horrible at this anyway and if this is important to you it is better to let
PostgreSQL have the RAM so that it can do it.

PS. Some background. We had shared_buffers=8GB initially. In
> pg_stat_bgwriter we saw that dirty buffers were written to disk more
> frequently by backends than during checkpoints (buffers_clean >
> buffers_checkpoint, and buffers_backend > buffers_checkpoint).

I don't think that there is any reason to think that buffers_clean >
buffers_checkpoint is a problem. In fact, you could argue that it is the
way it was designed to work. Although the background writer does need to
tell the checkpointer about every file it dirties, so it can be fsynced at
the end of the checkpoint. The overhead of this was minimal in my testing.

But buffers_backend > buffers_checkpoint could be a problem, especially if
they are also much larger than buffers_clean. But the wrinkle here is that
if you do bulk inserts or bulk updates (what about vacuums?), the backends
by design write their own dirty buffers. So if you do those kinds of
things, buffers_backend being large doesn't indicate much. There was a
patch someplace a while ago to separate the counters of backend-intentional
writes from backend-no-choice writes, but it never went anywhere.

According to pg_buffercache extension, there was very small percentage of
> dirty pages in shared buffers. The percentage of pages with usagecount >= 3
> was also low. Some of our more frequently used tables and indexes are more
> than 10 GB in size. This all suggested that probably the bigger tables and
> indexes, whenever scanned, are constantly flushing pages from the shared
> buffers area. After increasing shared_buffers to 32GB, the picture started
> looking healthier. There were 1GB+ of dirty pages in shared buffers (vs
> 40-200MB before), 30-60% of pages with usagecount >= 3 (vs 10-40% before),
> buffers_checkpoint started to grow faster than buffers_clean or
> buffers_backend.

It is not clear to me that this is the best way to measure health. Did
your response time go down? Did your throughput go up?

> There is still not all frequently used data fits in shared_buffers, so
> we're considering to increase the parameter more. I wanted to have some
> idea about how big it could reasonably be.
>

I've made it 95% of machine RAM in specialized cases. Checkpoints were
extremely traumatic, but that was simply because increasing shared_buffers
allowed the throughput to go up so much that the IO subsystem couldn't cope.

For many use cases, 50% of RAM is the pessimal size. Because much of the
OS cache simply becomes a copy of the shared_buffers, cutting our effective
RAM size. So I think you want shared_buffers to be either a smallish
fraction of RAM, so that the OS cache is the primary cache and
shared_buffers just holds the hottest pages and holds the dirty pages which
can't be written without a fsync of WAL. Or a largish fraction, so that OS
cache is little but a staging area to get data into and out of
shared_buffers. Of these two extremes, I think the smallish fraction is
the safest and more general approach.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2018-02-17 05:23:25 Re: Database health check/auditing
Previous Message Tim Cross 2018-02-17 03:54:06 List policy/procedures [was Database health check/auditing]