From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Trouble with hashagg spill I/O pattern and costing |
Date: | 2020-05-26 05:02:41 |
Message-ID: | CA+hUKG+q3Ma+sTKn2316Ofof6u21hh0cGJHcJ-fAnLxw=LXevw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 26, 2020 at 10:59 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
> >Do you think the difference in IO patterns is due to a difference in
> >handling reads vs. writes in the kernel? Or do you think that 128
> >blocks is not enough to amortize the cost of a seek for that device?
>
> I don't know. I kinda imagined it was due to the workers interfering
> with each other, but that should affect the sort the same way, right?
> I don't have any data to support this, at the moment - I can repeat
> the iosnoop tests and analyze the data, of course.
About the reads vs writes question: I know that reading and writing
two interleaved sequential "streams" through the same fd confuses the
read-ahead/write-behind heuristics on FreeBSD UFS (I mean: w(1),
r(42), w(2), r(43), w(3), r(44), ...) so the performance is terrible
on spinning media. Andrew Gierth reported that as a problem for
sequential scans that are also writing back hint bits, and vacuum.
However, in a quick test on a Linux 4.19 XFS system, using a program
to generate interleaving read and write streams 1MB apart, I could see
that it was still happily generating larger clustered I/Os. I have no
clue for other operating systems. That said, even on Linux, reads and
writes still have to compete for scant IOPS on slow-seek media (albeit
hopefully in larger clustered I/Os)...
Jumping over large interleaving chunks with no prefetching from other
tapes *must* produce stalls though... and if you crank up the read
ahead size to be a decent percentage of the contiguous chunk size, I
guess you must also waste I/O bandwidth on unwanted data past the end
of each chunk, no?
In an off-list chat with Jeff about whether Hash Join should use
logtape.c for its partitions too, the first thought I had was that to
be competitive with separate files, perhaps you'd need to write out a
list of block ranges for each tape (rather than just next pointers on
each block), so that you have the visibility required to control
prefetching explicitly. I guess that would be a bit like the list of
physical extents that Linux commands like filefrag(8) and xfs_bmap(8)
can show you for regular files. (Other thoughts included worrying
about how to make it allocate and stream blocks in parallel queries,
...!?#$)
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2020-05-26 06:05:17 | Re: what can go in root.crt ? |
Previous Message | Amit Kapila | 2020-05-26 04:57:27 | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |