Quick Links

Re: Trouble with hashagg spill I/O pattern and costing

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Trouble with hashagg spill I/O pattern and costing
Date:	2020-05-19 17:53:20
Message-ID:	20200519175320.tl4qm6le5dbg2pcf@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, May 19, 2020 at 09:27:34AM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
>> I think there are two related problem - with costing and with
>> excessive
>> I/O due to using logical tapes.
>
>Thank you for the detailed analysis. I am still digesting this
>information.
>
>> This kinda makes me question whether logical tapes are the right tool
>> for hashagg. I've read the explanation in logtape.c why it's about
>> the
>> same amount of I/O as using separate files, but IMO that only really
>> works for I/O patters similar to merge sort - the more I think about
>> this, the more I'm convinced we should just do what hashjoin is
>> doing.
>
>Fundamentally, sort writes sequentially and reads randomly; while
>HashAgg writes randomly and reads sequentially.
>

Not sure. I think the charts and stats of iosnoop data show that an
awful lot of reads during sort is actually pretty sequential. Moreover,
sort manages to read the data in much larger blocks - 128kB instead of
just 8kB (which is what hashagg seems to be doing).

I wonder why is that and if we could achieve that for hashagg too ...

>If the random writes of HashAgg end up fragmented too much on disk,
>then clearly the sequential reads are not so sequential anyway. The
>only way to avoid fragmentation on disk is to preallocate for the
>tape/file.
>

And if there a way to pre-allocate larger chunks? Presumably we could
assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x 8kB)
instead of just single block. I haven't seen anything like that in
tape.c, though ...

>BufFile (relying more on the OS) would probably do a better job of
>preallocating the disk space in a useful way; whereas logtape.c makes
>it easier to manage buffers and the overall number of files created
>(thereby allowing higher fanout of partitions).
>
>We have a number of possibilities here:
>
>1. Improve costing to reflect that HashAgg is creating more random IOs
>than Sort.

I think we'll need to do something about this, but I think we should try
improving the behavior first and then model the costing based on that.

>2. Reduce the partition fanout in the hopes that the OS does a better
>job with readahead.

I doubt this will make a significant difference. I think the problem is
the partitions end up interleaved way too much in the temp file, and I
don't see how a lower fanout would fix that.

BTW what do you mean when you say "fanout"? Do you mean how fast we
increase the number of partitions, or some parameter in particular?

>3. Switch back to BufFile, in which case we probably need to reduce the
>fanout for other reasons.

Maybe, although that seems pretty invasive post beta1.

>4. Change logtape.c to allow preallocation or to write in larger
>blocks.

I think this is what I suggested above (allocating 16 blocks at a time,
or something). I wonder how wasteful this would be, but I think not very
much. Essentially, with 1024 partitions and pre-allocating space in
128kB chunks, that means 128MB may end up unused, which seems ok-ish,
and I guess we could further restrict that by starting with lower value
and gradually increasing the number. Or something like that ...

>5. Change BufFile to allow more control over buffer usage, and switch
>to that.
>

Maybe. I don't recall what exactly is the issue with buffer usage, but I
think it has the same invasiveness issue as (3). OTOH it's what hashjoin
does, and we've lived with it for ages ...

>#1 or #2 are the least invasive, and I think we can get a satisfactory
>solution by combining those.
>

OK. I think tweaking the costing (and essentially reverting to what 12
does for those queries) is perfectly reasonable. But if we can actually
get some speedup thanks to hashagg, even better.

>I saw good results with the high fanout and low work_mem when there is
>still a lot of system memory. That's a nice benefit, but perhaps it's
>safer to use a lower fanout (which will lead to recursion) until we get
>a better handle on the IO patterns.
>

I don't know how much we can rely on that - once we push some of the
data from page cache, it has the issues I described. The trouble is
people may not have enough memory to keep everything in cache, otherwise
they might just as well bump up work_mem and not spill at all.

>Perhaps you can try recompiling with a lower max partitions and rerun
>the query? How much would we have to lower it for either the cost to
>approach reality or the OS readahead to become effective?
>

I can try that, of course. Which parameters should I tweak / how?

I can also try running it with BufFile, in case you prepare a WIP patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Trouble with hashagg spill I/O pattern and costing at 2020-05-19 16:27:34 from Jeff Davis

Responses

Re: Trouble with hashagg spill I/O pattern and costing at 2020-05-20 04:15:40 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2020-05-19 18:03:21	Re: factorial function/phase out postfix operators?
Previous Message	Lukas Fittl	2020-05-19 16:58:56	Re: PostgreSQL 13 Beta 1 Release Announcement Draft