Re: avoiding file system caching of a table

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Gabriel Sánchez Martínez <gabrielesanchez(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: avoiding file system caching of a table
Date: 2014-03-12 22:18:24
Message-ID: CAMkU=1y2hkv3oJ4B16fvYRn4sjZB3YjajPFE1UgoP-NyrQSJ3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Feb 17, 2014 at 7:30 PM, Gabriel Sánchez Martínez <
gabrielesanchez(at)gmail(dot)com> wrote:

>
> On 02/17/2014 08:45 PM, Jeff Janes wrote:
>
> On Mon, Feb 17, 2014 at 5:09 PM, "Gabriel E. Sánchez Martínez" <
> gabrielesanchez(at)gmail(dot)com> wrote:
>
>>
>>
>> I have read forum postings saying that the sync option affects writes,
>> and will not prevent reads from caching. At some forum posting I came
>> across nocache, a utility for linux. It is used by typing "nocache
>> <command>" in a shell. But I can't do that with a postgres process when a
>> connection opens because postgres is the one opening the process.
>
>
> You would have to start the entire service with that utility, then.
> Which again would defeat the purpose.
>
>
> Since a process is launched every time a session opens, e.g. a query
> window in pgAdmin, I thought it would be possible to do it per session
> rather than for the whole service. Either way, I agree this wouldn't solve
> the problem.
>

Perhaps there is a way to do that, but I haven't been able to find any
information on the "nocache" utility myself. Anyway, probably kernel
hackers rather than PostgreSQL hackers would be a better source of info on
that.

>
>> Does someone know a work-around, or a different solution to the problem?
>> Shouldn't PostgreSQL be smart about this and based on the statistics
>> collected for a table and on the query plan know the harm that will be done
>> if all of a very large table's pages are read and flush the cache?
>
>
> PostgreSQL does know this. It has a special ring "buffer access
> strategy" that it uses to prevent a large sequential scan from pushing all
> of the other data out of its shared_buffers. It sounds like it is the
> kernel which is failing to employ similar logic on the file cache which the
> *kernel* manages.
>
>
> I have no idea how the kernel manages its cache, but I think that since
> individual pages that store the data of the table are small, the kernel has
> no way of knowing that a process will read a very large number of
> relatively small files that collectively will cause harm. Maybe if it were
> a single file large than total physical RAM it would act differently. But
> I am just speculating.
>

I don't think the kernel cares much about file boundaries. The way it
usually works is one part of the cache is set aside for data pages that
were accessed once recently, and another part for data that was accessed
multiple times.

>
>
> The kernel does also have some logic to prevent this, but it may or may
> not be very effective in your case (you haven't us what version of the
> kernel you are using).
>
>
> Thanks for asking. 3.8.0-35-generic. I'm curious. What does it do? Or
> do you know where I can read about this (just out of curiosity).
>

I was hoping someone with better knowledge of the details would respond,
but since they haven't...I think 3.8.0 is in the range that uses half of
the cache memory for recently first-read pages, and half for
frequently-read pages, but allows the frequently-read half steal from the
other half if it would otherwise go unused.

Unfortunately I don't know of a good way to figure out when the various
features were added to which version of the kernel. It is very frustrating.
Also, I think various distributions might backport certain patches out of
sequence, so that the version number of the kernel itself is not even
deterministic for each feature. But I could be spreading misinformation.

>
> In fact one effort of the kernel to fix this problem for cases like
> yours ended up making it worse for other conditions, i.e. when the file
> being read sequentially was less than available RAM but greater than 1/2
> available RAM.
>
> You could try increasing shared_buffers (you haven't told us what it is
> set to now) until it takes up a big chunk of RAM, so that PostgreSQL
> manages more of the cache and the kernel manages less of it. Setting it
> like that has been reported to cause problems on write-heavy work loads,
> but I haven't heard of problems on read-mostly workloads.
>
>
> This server is read-mostly. It has 64 GB of RAM, a single 6-core i7
> processor, and four SATA hard drives on software RAID 10. I get about 400
> MB/s of sequential reads on simple benchmarks. Shared buffers is set to 16
> GB,
>

If the problem is that the kernel is mishandling its cache, you could
increase shared_buffers to 52GB and see if that helps. Hopefully you have a
testing environment that you can use to evaluate it before changing it on
production.

> temp buffers to 8 MB, work mem to 100 MB, and maintenance work mem to 100
> MB. I could probably tweak this but I know very little about it. Do you
> think I should set any of these higher? Other things run on the server,
> but most of the usage is PostgreSQL queries on tables of several hundred
> GB... some of which need to process whole tables.
>
> So other than the possibility of tweaking shared_buffers, the only other
> solution is getting a server with TBs of RAM?
>

Just doubling the RAM might do the job. For example, say the
frequently-used data takes up 3/4 of the kernel's available page cache.
Normally this is fine, because it is held in the cache. But then the
large single-pass query kicks in, and it starts evicting the
frequently-used data down to its "fair share" of 1/2 of the page cache.
Now 1/3 of the access to the frequently used data need to go to disk and
your performance drops. So you don't need to add enough RAM to store the
single-pass data, you just need enough to store ~2 times the
frequently-used data.

The first thing I'd try to figure out is whether your frequently-used data,
in the absence of the large queries, is residing all in shared_buffers, or
if it is split between shared_buffers and the kernel cache. If you can,
disable the large queries for a while, reset the statistics, let the small
queries run for a while, and then look in pg_stat_database to figure that
out, particularly blks_read and blks_hit.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Oscar Calderon 2014-03-12 23:13:11 After paying PG Associate Cert. Exam what's the next step?
Previous Message David Welton 2014-03-12 20:34:00 Re: named queries and the wire protocol