Re: modeling parallel contention (was: Parallel Append implementation)

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: modeling parallel contention (was: Parallel Append implementation)
Date: 2017-05-05 06:09:28
Message-ID: CAJ3gD9eWwQpcuUkV9mkHp9h2O0viYwGi31XAUpea-zLJn9-ZZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 May 2017 at 07:50, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 3 May 2017 at 07:13, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> It is of course possible that the Parallel Seq Scan could run into
>> contention problems if the number of workers is large, but in my
>> experience there are bigger problems here. The non-parallel Seq Scan
>> can also contend -- not of course over the shared mutex because there
>> isn't one, but over access to the blocks themselves. Every one of
>> those blocks has a content lock and a buffer header and so on, and
>> having multiple processes accessing those things at the same time
>> scales well, but not perfectly. The Hash node can also contend: if
>> the hash join spills to disk, you've got multiple processes reading
>> and writing to the temp directory at the same time and, of course,
>> that can be worse than just one process doing it -- sometimes much
>> worse. It can also be better, depending on how much I/O gets
>> generated and how much I/O bandwidth you have.
>
> Yeah, I did get some time to look over the contention in Parallel Seq
> Scan a while back and I discovered that on the machine that I was
> testing on. the lock obtained in heap_parallelscan_nextpage() was
> causing workers to have to wait for other workers to fetch their next
> task to work on.
>
> I ended up writing the attached (which I'd not intended to post until
> some time closer to when the doors open for PG11). At the moment it's
> basically just a test patch to see how it affects things when we give
> workers a bit more to do before they come back to look for more work.
> In this case, I've just given them 10 pages to work on, instead of the
> 1 that's allocated in 9.6 and v10.
>
> A quick test on a pretty large table on a large machine shows:
>
> Unpatched:
>
> postgres=# select count(*) from a;
> count
> ------------
> 1874000000
> (1 row)
>
> Time: 5211.485 ms (00:05.211)
>
> Patched:
>
> postgres=# select count(*) from a;
> count
> ------------
> 1874000000
> (1 row)
>
> Time: 2523.983 ms (00:02.524)

The result is quite impressive.

>
> So it seems worth looking into. "a" was just a table with a single int
> column. I'm unsure as yet if there are more gains to be had for tables
> with wider tuples. I do suspect the patch will be a bigger win in
> those cases, since there's less work to do for each page, e.g less
> advance aggregate calls, so likely they'll be looking for their next
> page a bit sooner.
>
> Now I'm not going to pretend that this patch is ready for the
> prime-time. I've not yet worked out how to properly report sync-scan
> locations without risking reporting later pages after reporting the
> end of the scan. What I have at the moment could cause a report to be
> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
> size. I'm also not sure how batching like this affect read-aheads, but
> at least the numbers above speak for something. Although none of the
> pages in this case came from disk.
>
> I'd had thoughts that the 10 pages wouldn't be constant, but the
> batching size would depend on the size of the relation to be scanned.
> I'd rough ideas to just try to make about 1 million batches. Something
> like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so
> that we only take more than 1 page if there's some decent amount to
> process. We don't want to make the batches too big as we might end up
> having to wait on slow workers at the end of a scan.

I was wondering : if we keep increasing the batch size, that might
lead to I/O contention. I mean, the higher the batch size, the higher
is the chance to cause more random I/O, because all workers would be
accessing disk blocks far away from each other in parallel. So there
might be a trade off here. (it's another thing that there needs to be
I/O contention testing done, in general, for many scenarios).

I believe there are certain parallel scans (parallel bitmap heap scan
? ) where the logic to go to the next block consumes time, so more
waits consequently.

What if we supply for each worker with a sequence of blocks to be
scanned, rather than a range of blocks. Each worker would have a list
of next few blocks, say :
w1 : 1, 5, 9, 13
w2 : 2, 6, 10, 14
w3 : 3, 7, 11, 15.
w4 : .....

May be the leader worker would do the accounting and store the
instructions for each of the workers at individual locations in shared
memory, so there won't be any contention while accessing them.

This may be simple/applicable for a sequential scan, but not for other
scans, some of which this may not be even possible. But basically I
was thinking of a way around to tackle shared memory contention as
well as random I/O.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-05-05 06:10:43 Re: modeling parallel contention (was: Parallel Append implementation)
Previous Message Amit Kapila 2017-05-05 05:34:14 Re: Change GetLastImportantRecPtr's definition? (wasSkip checkpoints, archiving on idle systems.)