Re: Parallel Seq Scan vs kernel read ahead

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan vs kernel read ahead
Date: 2020-06-10 12:33:45
Message-ID: CAApHDvr1DG5oPPEpOf0pf+2DAD1A-vuU1dig5n5zsQYHnkcKsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 10 Jun 2020 at 17:39, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Wed, 10 Jun 2020 at 17:21, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > I also heard from Andres that he likes this patch with his AIO
> > prototype, because of the way request merging works. So it seems like
> > there are several reasons to want it.
> >
> > But ... where should we get the maximum step size from? A GUC?
>
> I guess we'd need to determine if other step sizes were better under
> any conditions. I guess one condition would be if there was a LIMIT
> clause. I could check if setting it to 1024 makes any difference, but
> I'm thinking it won't since I got fairly consistent results on all
> worker settings with the patched version.

I did another round of testing on the same machine trying some step
sizes larger than 64 blocks. I can confirm that it does improve the
situation further going bigger than 64.

I got up as far as 16384, but made a couple of additional changes for
that run only. Instead of increasing the ramp-up 1 block at a time, I
initialised phsw_step_size to 1 and multiplied it by 2 until I reached
the chosen step size. With numbers that big, ramping up 1 block at a
time was slow enough that I'd never have reached the target step size

Here are the results of the testing:

Master:

max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481)
(706.2MB/sec)
max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556)
(320.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056)
(318.6MB/sec)

Patched stepsize = 64:

max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364)
(741.7MB/sec)
max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982)
(723.2MB/sec)
max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356)
(731.4MB/sec)

Patched stepsize = 1024:

max_parallel_workers_per_gather = 0: Time: 152599.159 ms (02:32.599)
(687.1MB/sec)
max_parallel_workers_per_gather = 1: Time: 104227.232 ms (01:44.227)
(1006.04MB/sec)
max_parallel_workers_per_gather = 2: Time: 97149.343 ms (01:37.149)
(1079.3MB/sec)

Patched stepsize = 8192:

max_parallel_workers_per_gather = 0: Time: 143524.038 ms (02:23.524)
(730.59MB/sec)
max_parallel_workers_per_gather = 1: Time: 102899.288 ms (01:42.899)
(1019.0MB/sec)
max_parallel_workers_per_gather = 2: Time: 91148.340 ms (01:31.148)
(1150.4MB/sec)

Patched stepsize = 16384 (power 2 ramp-up)

max_parallel_workers_per_gather = 0: Time: 144598.502 ms (02:24.599)
(725.16MB/sec)
max_parallel_workers_per_gather = 1: Time: 97344.160 ms (01:37.344)
(1077.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 88025.420 ms (01:28.025)
(1191.2MB/sec)

I thought about what you mentioned about a GUC, and I think it's a bad
idea to do that. I think it would be better to choose based on the
relation size. For smaller relations, we want to keep the step size
small. Someone may enable parallel query on such a small relation if
they're doing something like calling an expensive function on the
results, so we do need to avoid going large for small relations.

I considered something like:

create function nextpower2(a bigint) returns bigint as $$ declare n
bigint := 1; begin while n < a loop n := n * 2; end loop; return n;
end; $$ language plpgsql;
select pg_size_pretty(power(2,p)::numeric * 8192) rel_size,
nextpower2(power(2,p)::bigint / 1024) as stepsize from
generate_series(1,32) p;
rel_size | stepsize
----------+----------
16 kB | 1
32 kB | 1
64 kB | 1
128 kB | 1
256 kB | 1
512 kB | 1
1024 kB | 1
2048 kB | 1
4096 kB | 1
8192 kB | 1
16 MB | 2
32 MB | 4
64 MB | 8
128 MB | 16
256 MB | 32
512 MB | 64
1024 MB | 128
2048 MB | 256
4096 MB | 512
8192 MB | 1024
16 GB | 2048
32 GB | 4096
64 GB | 8192
128 GB | 16384
256 GB | 32768
512 GB | 65536
1024 GB | 131072
2048 GB | 262144
4096 GB | 524288
8192 GB | 1048576
16 TB | 2097152
32 TB | 4194304

So with that algorithm with this 100GB table that I've been using in
my test, we'd go with a step size of 16384. I think we'd want to avoid
going any more than that. The above code means we'll do between just
below 0.1% and 0.2% of the relation per step. If I divided the number
of blocks by say 128 instead of 1024, then that would be about 0.78%
and 1.56% of the relation each time. It's not unrealistic today that
someone might throw that many workers at a job, so, I'd say dividing
by 1024 or even 2048 would likely be about right.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jerome Wagner 2020-06-10 12:52:02 COPY, bytea streaming and memory footprint
Previous Message Amit Khandekar 2020-06-10 12:15:56 Re: Auto-vectorization speeds up multiplication of large-precision numerics