From: | Jeff Layton <jlayton(at)redhat(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Dave Chinner <david(at)fromorbit(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-17 12:34:46 |
Message-ID: | 20140117073446.7e0de941@tlielax.poochiereds.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 16 Jan 2014 20:48:24 -0500
Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner <david(at)fromorbit(dot)com> wrote:
> > But there's something here that I'm not getting - you're talking
> > about a data set that you want ot keep cache resident that is at
> > least an order of magnitude larger than the cyclic 5-15 minute WAL
> > dataset that ongoing operations need to manage to avoid IO storms.
> > Where do these temporary files fit into this picture, how fast do
> > they grow and why are do they need to be so large in comparison to
> > the ongoing modifications being made to the database?
>
> I'm not sure you've got that quite right. WAL is fsync'd very
> frequently - on every commit, at the very least, and multiple times
> per second even there are no commits going on just to make sure we get
> it all down to the platter as fast as possible. The thing that causes
> the I/O storm is the data file writes, which are performed either when
> we need to free up space in PostgreSQL's internal buffer pool (aka
> shared_buffers) or once per checkpoint interval (5-60 minutes) in any
> event. The point of this system is that if we crash, we're going to
> need to replay all of the WAL to recover the data files to the proper
> state; but we don't want to keep WAL around forever, so we checkpoint
> periodically. By writing all the data back to the underlying data
> files, checkpoints render older WAL segments irrelevant, at which
> point we can recycle those files before the disk fills up.
>
So this says to me that the WAL is a place where DIO should really be
reconsidered. It's mostly sequential writes that need to hit the disk
ASAP, and you need to know that they have hit the disk before you can
proceed with other operations.
Also, is the WAL actually ever read under normal (non-recovery)
conditions or is it write-only under normal operation? If it's seldom
read, then using DIO for them also avoids some double buffering since
they wouldn't go through pagecache.
Again, I think this discussion would really benefit from an outline of
the different files used by pgsql, and what sort of data access
patterns you expect with them.
> Temp files are something else again. If PostgreSQL needs to sort a
> small amount of data, like a kilobyte, it'll use quicksort. But if it
> needs to sort a large amount of data, like a terabyte, it'll use a
> merge sort.[1] The reason is of course that quicksort requires random
> access to work well; if parts of quicksort's working memory get paged
> out during the sort, your life sucks. Merge sort (or at least our
> implementation of it) is slower overall, but it only accesses the data
> sequentially. When we do a merge sort, we use files to simulate the
> tapes that Knuth had in mind when he wrote down the algorithm. If the
> OS runs short of memory - because the sort is really big or just
> because of other memory pressure - it can page out the parts of the
> file we're not actively using without totally destroying performance.
> It'll be slow, of course, because disks always are, but not like
> quicksort would be if it started swapping.
>
> I haven't actually experienced (or heard mentioned) the problem Jeff
> Janes is mentioning where temp files get written out to disk too
> aggressively; as mentioned before, the problems I've seen are usually
> the other way - stuff not getting written out aggressively enough.
> But it sounds plausible. The OS only lets you set one policy, and if
> you make that file right for permanent data files that get
> checkpointed it could well be wrong for temp files that get thrown
> out. Just stuffing the data on RAMFS will work for some
> installations, but might not be good if you actually do want to
> perform sorts whose size exceeds RAM.
>
> BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting
> in attending Collab on behalf of the PostgreSQL community. Although
> the prospect of a cross-country flight is a somewhat depressing
> thought, it does sound pretty cool, so I'm potentially interested. I
> have no idea what the procedure is here for moving forward though,
> especially since it sounds like there might be only one seat available
> and I don't know who else may wish to sit in it.
>
--
Jeff Layton <jlayton(at)redhat(dot)com>
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2014-01-17 12:53:29 | Re: patch: option --if-exists for pg_dump |
Previous Message | Magnus Hagander | 2014-01-17 12:33:37 | Re: Feature request: Logging SSL connections |