Quick Links

Re: Tuplesort merge pre-reading

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>
Subject:	Re: Tuplesort merge pre-reading
Date:	2017-04-14 05:19:58
Message-ID:	CAH2-WznrO1XQ5F3Mb+mWyrE_aY5DJWOFh=ePbw1BVi1=JoG9sQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Apr 13, 2017 at 9:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm fairly sure that the point was exactly what it said, ie improve
> locality of access within the temp file by sequentially reading as many
> tuples in a row as we could, rather than grabbing one here and one there.
>
> It may be that the work you and Peter G. have been doing have rendered
> that question moot. But I'm a bit worried that the reason you're not
> seeing any effect is that you're only testing situations with zero seek
> penalty (ie your laptop's disk is an SSD). Back then I would certainly
> have been testing with temp files on spinning rust, and I fear that this
> may still be an issue in that sort of environment.

I actually think Heikki's work here would particularly help on
spinning rust, especially when less memory is available. He
specifically justified it on the basis of it resulting in a more
sequential read pattern, particularly when multiple passes are
required.

> The larger picture to be drawn from that thread is that we were seeing
> very different performance characteristics on different platforms.
> The specific issue that Tatsuo-san reported seemed like it might be
> down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
> but the point that this stuff can be environment-dependent is still
> something to take to heart.

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.

Interestingly enough, I think that Knuth was pretty much spot on with
his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
df700e6 (where the sweet spot of merge order 7 was no longer always
used) was effective because it masked certain overheads that we
experience when doing multiple passes, overheads that Heikki and I
mostly removed. This was confirmed by Robert's testing of my merge
order cap work for commit fc19c18, where he found that using 7 tapes
was only slightly worse than using many hundreds of tapes. If we could
somehow be completely effective in making access to logical tapes
perfectly sequential, then 7 tapes would probably be noticeably
*faster*, due to CPU caching effects.

Knuth was completely correct to say that it basically made no
difference once more than 7 tapes are used to merge, because he didn't
have logtape.c fragmentation to worry about.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

In response to

Re: Tuplesort merge pre-reading at 2017-04-14 04:51:15 from Tom Lane

Responses

Re: Tuplesort merge pre-reading at 2017-04-14 05:49:47 from Peter Geoghegan
Re: Tuplesort merge pre-reading at 2017-04-14 12:57:32 from Robert Haas
Re: Tuplesort merge pre-reading at 2017-04-18 08:05:40 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Noah Misch	2017-04-14 05:49:29	Re: [pgsql-www] Small issue in online devel documentation build
Previous Message	Amit Langote	2017-04-14 05:10:12	Re: pg_dump emits ALTER TABLE ONLY partitioned_table