Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-30 00:06:03
Message-ID: CAH2-WzkK4qnRuM-kh+9w4AbyyPz9+WEUv_8wNFbPd9Eb9xn=2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I have posted the patch for the above API and posted it on a new
> thread [1]. Do let me know either here or on that thread if the patch
> suffices your need?

I've responded to you over on that thread. Thanks again for helping me.

I already have a revision of my patch lined up that is coded to target
your new WaitForParallelWorkersToAttach() interface, plus some other
changes. These include:

* Make the leader's worker Tuplesortstate complete before the main
leader Tuplesortstate even begins, making it very clear that nbtsort.c
does not rely on knowing the number of launched workers up-front. That
should make Robert a bit happier about our ability to add additional
workers fairly late in the process, in a future tuplesort client that
finds that to be useful.

* I've added a new "parallel" argument to index_build(), which
controls whether or not we even call the plan_create_index_workers()
cost model. When this is false, we always do a serial build. This was
added because I noticed that TRUNCATE REINDEXes the table at a time
when parallelism couldn't possibly be useful, which still used
parallelism. Best to have the top-level caller opt in or opt out.

* Polished the docs some more.

* Improved commentary on randomAccess/writable leader handling within
logtape.c. We could still support that, if we were willing to make
shared Buffiles that are opened within another backend writable. I'm
not proposing to do that, but it's nice that we could.

I hesitate to post something that won't cleanly apply on the master
branch's tip, but otherwise I am ready to send this new revision of
the patch right away. It seems likely that Robert will commit your
patch within a matter of days, once some minor issues are worked
through, at which point I'll send what I have. if anyone prefers, I
can post the patch immediately, and break out the
WaitForParallelWorkersToAttach() as the second patch in a cumulative
patch set. Right now, I'm out of things to work on here.

Notes on how I've stress-tested parallel CREATE INDEX:

I can recommend using the amcheck heapallverified functionality [1]
from the Github version of amcheck to test this patch. You will need
to modify the call to IndexBuildHeapScan() that the extension makes,
to add a new NULL "scan" argument, since parallel CREATE INDEX changes
the signature of IndexBuildHeapScan(). That's trivial, though.

Note that parallel CREATE INDEX should produce relfiles that are
physically identical to a serial CREATE INDEX, since index tuplesorts
are generally always deterministic. IOW, we use a heap TID tie-breaker
within tuplesort.c from B-Tree index tuples, which assures us that
varying maintenance_work_mem won't affect the final output even in a
tiny, insignificant way -- using parallelism should not change
anything about the exact output, either. At one point I was testing
this patch by verifying not only that indexes were sane, but that they
were physically identical to what a serial sort (in the master branch)
produced (I only needed to mask page LSNs).

Finally, yet another good way to test this patch is to verify that
everything continues to work when MAX_PHYSICAL_FILESIZE is modified to
be BLCKSZ (2^13 rather than 2^30). You will get many many BufFile
segments that way, which could in theory reveal bugs in rare edge
cases that I haven't considered. This is a strategy that led to my
finding a bug in v10 at one point [2], as well as bugs in earlier
versions of Thomas' parallel hash join patch set. It worked for me
twice already, so it seems like a good strategy. It may be worth
*combining* with some other stress-testing strategy.

[1] https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification
[2] https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-01-30 00:15:13 Re: [HACKERS] MERGE SQL Statement for PG11
Previous Message Peter Eisentraut 2018-01-29 23:45:38 Re: Security leak with trigger functions?