Re: BitmapHeapScan streaming read user and prelim refactoring

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2025-02-10 22:06:08
Message-ID: cg354qt5fuecp2gbdrervaf6psg3ycrwamvqrzfdlnvtgiwldy@dbz2zekqvawk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-02-10 16:24:25 -0500, Robert Haas wrote:
> On Mon, Feb 10, 2025 at 1:11 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > Certainly for the "localized" regressions, and cases when bitmapheapscan
> > would not be picked. The eic=1 case makes me a bit more nervous, because
> > it's default and affects NVMe storage. Would be good to know why is
> > that, or perhaps consider bumping up eic default. Not sure.
>
> I'm relatively upset by the fact that effective_io_concurrency is
> measured in units that are, AFAIUI, completely stupid. The original
> idea was that you would set it to the number of spindles you have. But
> from what I have heard, that didn't actually work: you needed to set
> it to a significantly higher number.

Which isn't too surprising. I think there are a few main reasons #spindles
was way too low:

1) Decent interfaces to rotating media benefit very substantially from having
multiple requests in flight, for each spindle. The disk can reorder the
actual disk access so they each can be executed without needing on average
half a rotation for each request. Similar reordering can also happen on
the OS level, even if not quite to the same degree of benefit.

2) Even if each spindle could only execute one IO request, you still want to
start those requests substantially earlier than allowed by
effective_io_concurrency=#spindles, to allow for the IOs to complete. With
e.g. bitmap heap scans, there often are sequences of blocks that can be
read together in one IO.

3) A single spindle commonly had multiple platters and a disk could read from
multiple platters within a single rotation.

However, I'm not so sure that the unit of effective_io_concurrency is really
the main issue. It IMO makes sense to limit the max number of requests one
scan can trigger, to prevent one scan from completely swamping the IO request
queue for the next several seconds (trivially possible without a limit on
smaller cloud storage disks with low iops).

It's plausible that we would want to separately control the "distance" from
the currently "consumed" position position and the
maximum-io-requests-in-flight. But I think we might be able to come up with a
tuning algorithm that can tune the distance based on the observed latency and
"consumption" speed.

I think what we eventually should get rid of is the "effective_" prefix. My
understanding is that basically there to denote that we don't really have an
idea what concurrency is achieved, as posix_fadvise() neither tells us if IO
was necessary in the first place, nor whether multiple blocks could be read in
one IO, nor when IO has completed.

The read_stream implementation already has redefined effective_io_concurrency
to only count one IO for a multi-block read.

Of course that doesn't mean we shouldn't change the default (we very clearly
should) or the documentation for the GUC.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-02-10 22:25:08 Re: describe special values in GUC descriptions more consistently
Previous Message Melanie Plageman 2025-02-10 21:41:05 Re: BitmapHeapScan streaming read user and prelim refactoring