From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Trouble with hashagg spill I/O pattern and costing |
Date: | 2020-05-21 19:13:18 |
Message-ID: | 20200521191318.gw3xjivrxracvzmg@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>
I think the PoC patch goes in the right direction. I have two ideas how
to improve it a bit:
1) Instead of assigning the pages one by one, we can easily extend the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks. And we could keep just a min/max
of free blocks, not an array with fixed number of elements.
2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.
IIUC the danger of pre-allocating blocks is that we might not fill them,
resulting in temp file much larger than necessary. It might be harmless
on some (most?) current filesystems that don't actually allocate space
for blocks that are never written, but it also confuses our accounting
of temporary file sizes. So we should try to limit that, and growing the
number of pre-allocated blocks over time seems reasonable.
Both (1) and (2) seem fairly simple, not much more complex than the
current PoC patch.
I also wonder if we could collect / report useful statistics about I/O
on the temporary file, not just the size. I mean, how many pages we've
written/read, how sequential it was, etc. But some of that is probably
only visible at the OS level (e.g. we have no insignt into how the
kernel combines writes in page cache, etc.). This is clearly matter for
v14, though.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2020-05-21 19:17:39 | Re: Trouble with hashagg spill I/O pattern and costing |
Previous Message | Jeff Davis | 2020-05-21 19:04:19 | Re: Trouble with hashagg spill I/O pattern and costing |