Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Mel Gorman <mgorman(at)suse(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-17 15:37:55
Message-ID: 20140117153755.GZ4963@suse.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 16, 2014 at 04:30:59PM -0800, Jeff Janes wrote:
> On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman <mgorman(at)suse(dot)de> wrote:
>
> > On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > >
> > > > That could be something we look at. There are cases buried deep in the
> > > > VM where pages get shuffled to the end of the LRU and get tagged for
> > > > reclaim as soon as possible. Maybe you need access to something like
> > > > that via posix_fadvise to say "reclaim this page if you need memory but
> > > > leave it resident if there is no memory pressure" or something similar.
> > > > Not exactly sure what that interface would look like or offhand how it
> > > > could be reliably implemented.
> > > >
> > >
> > > I think the "reclaim this page if you need memory but leave it resident
> > if
> > > there is no memory pressure" hint would be more useful for temporary
> > > working files than for what was being discussed above (shared buffers).
> > > When I do work that needs large temporary files, I often see physical
> > > write IO spike but physical read IO does not. I interpret that to mean
> > > that the temporary data is being written to disk to satisfy either
> > > dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> > > cache and so disk reads are not needed to satisfy it. So a hint that
> > says
> > > "this file will never be fsynced so please ignore dirty_*bytes and
> > > dirty_expire_centisecs.
> >
> > It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
> > were the problem here.
>
>
> Is there an easy way to tell? I would guess it has to be at least
> dirty_expire_centisecs, if not both, as a very large sort operation takes a
> lot more than 30 seconds to complete.
>

There is not an easy way to tell. To be 100%, it would require an
instrumentation patch or a systemtap script to detect when a particular page
is being written back and track the context. There are approximations though.
Monitor nr_dirty pages over time. If at the time of the stall there are fewer
dirty pages than allowed by dirty_ratio then the dirty_expire_centisecs
kicked in. That or monitor the process for stalls, when it stalls check
/proc/PID/stack and see if it's stuck in balance_dirty_pages or something
similar which would indicate the process hit dirty_ratio.

> > An interface that forces a dirty page to stay dirty
> > regardless of the global system would be a major hazard. It potentially
> > allows the creator of the temporary file to stall all other processes
> > dirtying pages for an unbounded period of time.
>
> Are the dirty ratio/bytes limits the mechanisms by which adequate clean
> memory is maintained?

Yes, for file-backed pages.

> I thought those were there just to but a limit on
> long it would take to execute a sync call should one be issued, and there
> were other setting which said how much clean memory to maintain. It should
> definitely write out the pages if it needs the memory for other things,
> just not write them out due to fear of how long it would take to sync it if
> a sync was called. (And if it needs the memory, it should be able to write
> it out quickly as the writes would be mostly sequential, not
> random--although how the kernel can believe me that that will always be the
> case could a problem)
>

It has been suggested on more than one occasion that a more sensible
interface would be to "do not allow more dirty data than it takes N seconds
to writeback". The details of how to implement this are tricky and no one
has taken up the challenge yet.

> > I proposed in another part
> > of the thread a hint for open inodes to have the background writer thread
> > ignore dirty pages belonging to that inode. Dirty limits and fsync would
> > still be obeyed. It might also be workable for temporary files but the
> > proposal could be full of holes.
> >
>
> If calling fsync would fail with an error, would that lower the risk of DoS?
>

I do not understand the proposal. If there are pages that must remain
dirty and the kernel cannot touch then there will be the risk that
dirty_ratio number of pages are all untouchable and the system livelocks
until userspace takes an action.

That still leaves the possibility of flagging temp pages that should
only be written to disk if the kernel really needs to.

--
Mel Gorman
SUSE Labs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-01-17 15:44:41 Re: Funny representation in pg_stat_statements.query.
Previous Message Heikki Linnakangas 2014-01-17 15:34:10 Re: WAL Rate Limiting