Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-08-06 19:48:18
Message-ID: CAM3SWZTJAFPcZdsS6_DvgHzOqk0D-opE-Y-8MfnAp6J+FN6v2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 6, 2016 at 6:46 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think here some of the factors like how many workers will be used
> for merge phase might impact the performance. Having too many
> workers can lead to more communication cost and having too few workers
> might not yield best results for merge. One thing, I have noticed
> that in general for sorting, some of the other databases uses range
> partitioning [1], now that might not be what is good for us.

I don't disagree with anything you say here. I acknowledged that
partitioning will probably be important for sorting in my introductory
e-mail, after all.

> I see
> you mentioned above that why it is not good [2], but I don't
> understand why you think it is a risky assumption to assume good
> partition boundaries for parallelizing sort.

Well, apparently there are numerous problems with partitioning in
systems like SQL Server and Oracle in the worst case. For one thing,
in the event of a misestimation (or failure of the dynamic sampling
that I presume can sometimes be used), workers can be completely
starved of work for the entire duration of the sort. And for CREATE
INDEX to get much of any benefit, all workers must write their part of
the index independently, too. This can affect the physical structure
of the final index. SQL Server also has a caveat in its documentation
about this resulting in an unbalanced final index, which I imagine
could be quite bad in the worst case.

I believe that it's going to be hard to get any version of this that
writes the index simultaneously in each worker accepted for these
reasons. This patch I came up with isn't very different from the
serial case at all. Any index built in parallel by the patch ought to
have relfilenode files on the filesystem that are 100% identical to
those produced by the serial case, in fact (since CREATE INDEX does
not set LSNs in the new index pages). I've actually developed a simple
way of "fingerprinting" indexes during testing of this patch, knowing
that hashing the files on disk ought to produce a perfect match
compared to a master branch serial sort case.

At the same time, any information that I've seen about how much
parallel CREATE INDEX speeds things up in these other systems
indicates that the benefits are very similar. It tends to be in the 2x
- 3x range, with the same reduction in throughput seen at about 16
workers, after we peak at about 8 workers. So, I think that the
benefits of partitioning are not really seen with CREATE INDEX (I
think of partitioning as more of a parallel query thing). Obviously,
any benefit that might still exist for CREATE INDEX in particular,
when weighed against the costs, makes partitioning look pretty
unattractive as a next step.

I think that during the merge phase of parallel CREATE INDEX as
implemented, the system generally still isn't that far from being I/O
bound. Whereas, with parallel query, partitioning makes each worker
able to return one tuple from its own separated range very quickly,
not just one worker (presumably, each worker merges non-overlapping
"ranges" from runs initially sorted in each worker. Each worker
subsequently merges after a partition-wise redistribution of the
initial fully sorted runs, allowing for dynamic sampling to optimize
the actual range used for load balancing.). The workers can then do
more CPU-bound processing in whatever node is fed by each worker's
ranged merge; everything is kept busy. That's the approach that I
personally had in mind for partitioning, at least. It's really nice
for parallel query to be able to totally separate workers after the
point of redistribution. CREATE INDEX is not far from being I/O bound
anyway, though, so it benefits far less. (Consider how fast the merge
phase still is at writing out the index in *absolute* terms.)

Look at figure 9 in this paper: http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf

Even in good cases for "independent sorting", there is only a benefit
seen at 8 cores. At the same time, I can only get about 6x scaling
with 8 workers, just for the initial generation of runs.

All of these factors are why I believe I'm able to compete well with
other systems with this relatively straightforward, evolutionary
approach. I have a completely open mind about partitioning, but my
approach makes sense in this context.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2016-08-06 20:33:19 Re: Oddity with NOT IN
Previous Message Jim Nasby 2016-08-06 18:13:04 Re: Oddity with NOT IN