Re: Parallel Seq Scan vs kernel read ahead

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan vs kernel read ahead
Date: 2020-06-15 21:09:05
Message-ID: CAApHDvqi+z0+zgUuKFJDntPnTgb8X31w0PZJw-4M5b9MFE5woQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 16 Jun 2020 at 03:29, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > The performance can vary based on qualification where some workers
> > discard more rows as compared to others, with the current system with
> > step-size as one, the probability of unequal work among workers is
> > quite low as compared to larger step-sizes.
>
> It seems like this would require incredibly bad luck, though. If the
> step size is less than 1/1024 of the relation size, and we ramp down
> for, say, the last 5% of the relation, then the worst case is that
> chunk 972 of 1024 is super-slow compared to all the other blocks, so
> that it takes longer to process chunk 972 only than it does to process
> chunks 973-1024 combined. It is not impossible, but that chunk has to
> be like 50x worse than all the others, which doesn't seem like
> something that is going to happen often enough to be worth worrying
> about very much. I'm not saying it will never happen. I'm just
> skeptical about the benefit of adding a GUC or reloption for a corner
> case like this. I think people will fiddle with it when it isn't
> really needed, and won't realize it exists in the scenarios where it
> would have helped.

I'm trying to think of likely scenarios where "lots of work at the
end" is going to be common. I can think of queue processing, but
everything I can think about there requires an UPDATE to the processed
flag, which won't be using parallel query anyway. There's then
processing something based on some period of time like "the last
hour", "today". For append-only tables the latest information is
likely to be at the end of the heap. For that, anyone that's getting
a SeqScan on a large relation should likely have added an index. If a
btree is too costly, then BRIN is pretty perfect for that case.

FWIW, I'm not really keen on adding a reloption or a GUC. I've also
voiced here that I'm not even keen on the ramp-up.

To summarise what's all been proposed so far:

1. Use a constant, (e.g. 64) as the parallel step size
2. Ramp up the step size over time
3. Ramp down the step size towards the end of the scan.
4. Auto-determine a good stepsize based on the size of the relation.
5. Add GUC to allow users to control or override the step size.
6. Add relption to allow users to control or override the step size.

Here are my thoughts on each of those:

#1 is a bad idea as there are legitimate use-cases for using parallel
query on small tables. e.g calling some expensive parallel safe
function. Small tables are more likely to be cached.
#2 I don't quite understand why this is useful
#3 I understand this is to try to make it so workers all complete
around about the same time.
#4 We really should be doing it this way.
#5 Having a global knob to control something that is very specific to
the size of a relation does not make much sense to me.
#6. I imagine someone will have some weird use-case that works better
when parallel workers get 1 page at a time. I'm not convinced that
they're not doing something else wrong.

So my vote is for 4 with possibly 3, if we can come up with something
smart enough * that works well in parallel. I think there's less of a
need for this if we divided the relation into more chunks, e.g. 8192
or 16384.

* Perhaps when there are less than 2 full chunks remaining, workers
can just take half of what is left. Or more specifically
Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work
out allocating an amount of pages proportional to the amount of beer
each mathematician receives in the "An infinite number of
mathematicians walk into a bar" joke, obviously with the exception
that we stop dividing when we get to 1. However, I'm not quite sure
how well that can be made to work with multiple bartenders working in
parallel.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-06-15 21:53:34 Re: language cleanups in code and docs
Previous Message Robert Haas 2020-06-15 20:49:04 Re: tar-related code in PostgreSQL