Re: Trouble with hashagg spill I/O pattern and costing

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Trouble with hashagg spill I/O pattern and costing
Date: 2020-05-26 05:02:41
Message-ID: CA+hUKG+q3Ma+sTKn2316Ofof6u21hh0cGJHcJ-fAnLxw=LXevw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 26, 2020 at 10:59 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
> >Do you think the difference in IO patterns is due to a difference in
> >handling reads vs. writes in the kernel? Or do you think that 128
> >blocks is not enough to amortize the cost of a seek for that device?
>
> I don't know. I kinda imagined it was due to the workers interfering
> with each other, but that should affect the sort the same way, right?
> I don't have any data to support this, at the moment - I can repeat
> the iosnoop tests and analyze the data, of course.

About the reads vs writes question: I know that reading and writing
two interleaved sequential "streams" through the same fd confuses the
read-ahead/write-behind heuristics on FreeBSD UFS (I mean: w(1),
r(42), w(2), r(43), w(3), r(44), ...) so the performance is terrible
on spinning media. Andrew Gierth reported that as a problem for
sequential scans that are also writing back hint bits, and vacuum.
However, in a quick test on a Linux 4.19 XFS system, using a program
to generate interleaving read and write streams 1MB apart, I could see
that it was still happily generating larger clustered I/Os. I have no
clue for other operating systems. That said, even on Linux, reads and
writes still have to compete for scant IOPS on slow-seek media (albeit
hopefully in larger clustered I/Os)...

Jumping over large interleaving chunks with no prefetching from other
tapes *must* produce stalls though... and if you crank up the read
ahead size to be a decent percentage of the contiguous chunk size, I
guess you must also waste I/O bandwidth on unwanted data past the end
of each chunk, no?

In an off-list chat with Jeff about whether Hash Join should use
logtape.c for its partitions too, the first thought I had was that to
be competitive with separate files, perhaps you'd need to write out a
list of block ranges for each tape (rather than just next pointers on
each block), so that you have the visibility required to control
prefetching explicitly. I guess that would be a bit like the list of
physical extents that Linux commands like filefrag(8) and xfs_bmap(8)
can show you for regular files. (Other thoughts included worrying
about how to make it allocate and stream blocks in parallel queries,
...!?#$)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2020-05-26 06:05:17 Re: what can go in root.crt ?
Previous Message Amit Kapila 2020-05-26 04:57:27 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions