Re: Trouble with hashagg spill I/O pattern and costing

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Trouble with hashagg spill I/O pattern and costing
Date: 2020-05-21 19:13:18
Message-ID: 20200521191318.gw3xjivrxracvzmg@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>

I think the PoC patch goes in the right direction. I have two ideas how
to improve it a bit:

1) Instead of assigning the pages one by one, we can easily extend the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks. And we could keep just a min/max
of free blocks, not an array with fixed number of elements.

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.

IIUC the danger of pre-allocating blocks is that we might not fill them,
resulting in temp file much larger than necessary. It might be harmless
on some (most?) current filesystems that don't actually allocate space
for blocks that are never written, but it also confuses our accounting
of temporary file sizes. So we should try to limit that, and growing the
number of pre-allocated blocks over time seems reasonable.

Both (1) and (2) seem fairly simple, not much more complex than the
current PoC patch.

I also wonder if we could collect / report useful statistics about I/O
on the temporary file, not just the size. I mean, how many pages we've
written/read, how sequential it was, etc. But some of that is probably
only visible at the OS level (e.g. we have no insignt into how the
kernel combines writes in page cache, etc.). This is clearly matter for
v14, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-05-21 19:17:39 Re: Trouble with hashagg spill I/O pattern and costing
Previous Message Jeff Davis 2020-05-21 19:04:19 Re: Trouble with hashagg spill I/O pattern and costing