Re: Gather Merge

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather Merge
Date: 2016-11-11 12:56:33
Message-ID: CAGPqQf3eBDMLYCzNWixbNQ8o5Czfr_Qv5eSomDEBstzMROgioQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
> > Please find attached latest patch which fix the review point as well as
> > additional clean-up.
>
> I've signed up to review this patch and I'm planning to do some
> testing. Here's some initial feedback after a quick read-through:
>
>
Thanks Thomas.

> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
>
> Clunky ternary operator... how about "!initialize".
>
>
Fixed.

> +/*
> + * Function clear out a slot in the tuple table for each gather merge
> + * slots and returns the clear clear slot.
> + */
>
> Maybe better like this? "_Clear_ out a slot in the tuple table for
> each gather merge _slot_ and _return_ the _cleared_ slot."
>
>
Fixed.

> + /* Free tuple array as we no more need it */
>
> "... as we don't need it any more"
>
>
Fixed

> +/*
> + * Read the next tuple for gather merge.
> + *
> + * Function fetch the sorted tuple out of the heap.
> + */
>
> "_Fetch_ the sorted tuple out of the heap."
>
>
Fixed

> + * Otherwise, pull the next tuple from whichever participate we
> + * returned from last time, and reinsert the index into the heap,
> + * because it might now compare differently against the existing
>
> s/participate/participant/
>
>
Fixed.

> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
>
> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
> Group"?

Fixed.

> Are we supposed to be blaming the University of California
> for new files?
>
>
Not quite sure about this, so keeping this as it is.

> +#include "executor/tqueue.h"
> +#include "miscadmin.h"
> +#include "utils/memutils.h"
> +#include "utils/rel.h"
> +#include "lib/binaryheap.h"
>
> Not correctly sorted.
>
>
Copied from nodeGather.c. But Fixed here.

> + /*
> + * store the tuple descriptor into gather merge state, so we can use it
> + * later while initilizing the gather merge slots.
> + */
>
> s/initilizing/initializing/
>
>
Fixed.

> +/* ----------------------------------------------------------------
> + * ExecEndGatherMerge
> + *
> + * frees any storage allocated through C routines.
> + * ----------------------------------------------------------------
>
> The convention in Postgres code seems to be to use a form like "Free
> any storage ..." in function documentation. Not sure if that's an
> imperative, an infinitive, or if the word "we" is omitted since
> English is so fuzzy like that, but it's inconsistent with other
> documentation to use "frees" here. Oh, I see that exact wording is in
> several other files. I guess I'll just leave this as a complaint
> about all those files then :-)
>
>
Sure.

> + * Pull atleast single tuple from each worker + leader and set up the
> heap.
>
> s/atleast single/at least a single/
>
>
Fixed.

> + * Read the tuple for given reader into nowait mode, and form the tuple
> array.
>
> s/ into / in /
>
>
Fixed.

> + * Function attempt to read tuple for the given reader and store it into
> reader
>
> s/Function attempt /Attempt /
>
>
Fixed.

> + * Function returns true if found tuple for the reader, otherwise returns
>
> s/Function returns /Return /
>
>
Fixed.

> + * First try to read tuple for each worker (including leader) into nowait
> + * mode, so that we initialize read from each worker as well as leader.
>
> I wonder if it would be good to standardise on the terminology we use
> when we mean workers AND the leader. In my Parallel Shared Hash work,
> I've been saying 'participants' if I mean = workers + leader. What do
> you think?
>
>
I am not quite sure about participants. In my options when we explicitly
say workers + leader its more clear. I am open to change is if committer
thinks otherwise.

> + * After this if all active worker unable to produce the tuple, then
> + * re-read and this time read the tuple into wait mode. For the worker,
> + * which was able to produced single tuple in the earlier loop and still
> + * active, just try fill the tuple array if more tuples available.
> + */
>
> How about this? "After this, if all active workers are unable to
> produce a tuple, then re-read and this time us wait mode. For workers
> that were able to produce a tuple in the earlier loop and are still
> active, just try to fill the tuple array if more tuples are
> available."
>
>
Fixed.

> + * The heap is never spilled to disk, since we assume N is not very
> large. So
> + * this is much simple then cost_sort.
>
> s/much simple then/much simpler than/
>
>
Fixed.

> + /*
> + * Avoid log(0)...
> + */
> + N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
> + logN = LOG2(N);
> ...
> + /* Per-tuple heap maintenance cost */
> + run_cost += path->path.rows * comparison_cost * 2.0 * logN;
>
> Why multiply by two? The comment above this code says "about log2(N)
> comparisons to delete the top heap entry and another log2(N)
> comparisons to insert its successor". In fact gather_merge_getnext
> calls binaryheap_replace_first, which replaces the top element without
> any comparisons at all and then performs a sift-down in log2(N)
> comparisons to find its new position. There is no per-tuple "delete"
> involved. We "replace" the top element with the value it already had,
> just to trigger the sift-down, because we know that our comparator
> function might have a new opinion of the sort order of this element.
> Very clever! The comment and the 2.0 factor in cost_gather_merge seem
> to be wrong though -- or am I misreading the code?
>
> See cost_merge_append.

> Also, shouldn't we add 1 to N to account for the leader? Suppose
> there are 2 workers. There are 3 elements in the binary heap. The
> element to be sifted down must be compared against either 1 or 2
> others to reorganise the heap. Surely in that case we should estimate
> log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.
>

Yes, good catch. For Gather Merge leader always participate, so
we should num_workers + 1.

>
> I suspect that the leader's contribution will be equivalent to a whole
> worker if the plan involves a sort: as soon as the leader pulls a
> tuple in gather_merge_init, the sort node will pull all the tuples it
> can in a tight loop. It's unfortunate that cost_seqscan has to
> estimate what the leader's contribution will be without knowing
> whether it has a "greedy" high-startup-cost consumer like a sort or
> hash node where the leader will contribute a whole backend's full
> attention as soon as it executes the plan, or a lazy consumer where
> the leader will probably not contribute much if there are enough
> workers to keep it distracted. In the case of a Gather Merge -> Sort
> -> Parallel Seq Scan plan, I think we will overestimate the number of
> rows (per participant), because cost_seqscan will guess that the
> leader is spending 30% of its time per worker servicing the workers,
> when in fact it will be sucking tuples into a sort node just as fast
> as anyone else. But I don't see what this patch can do about that...
>
>
Exactly. There is very thin line - when it comes to calculating the cost.
In general, while calculating the cost for GM, I just tried to be similar
to the Gather + MergeAppend.

> + * When force is true, function reads the tuple into wait mode. For gather
> + * merge we need to fill the slot from which we returned the earlier
> tuple, so
> + * this require tuple to be read into wait mode. During initialization
> phase,
> + * once we try to read the tuple into no-wait mode as we want to
> initialize all
> + * the readers. Refer gather_merge_init() for more details.
> + *
> + * Function returns true if found tuple for the reader, otherwise returns
> + * false.
> + */
> +static bool
> +gather_merge_readnext(GatherMergeState *gm_state, int reader, bool force)
>
> s/into wait mode/in wait mode/
>
> This appears throughout the comments; not sure if I can explain this
> well but "in wait mode" describes a state of being which is wanted
> here, "into wait mode" describes some kind of change or movement or
> insertion.
>
> Perhaps it would be better to say "reads the tuple _queue_ in wait
> mode", just to make clearer that this is talking about the wait/nowait
> feature of tuple queues, and perhaps also note that the leader always
> waits since it executes the plan.
>
>
Fixed. Just choose to s/into wait mode/in wait mode/

Maybe we should use "bool nowait" here anway, mirroring the TupleQueue
> interface? Why introduce another terminology for the same thing with
> inverted sense?
>
>
Agree with you. Changed the function gm_readnext_tuple() &
gather_merge_readnext()
APIs.

> +/*
> + * Read the tuple for given reader into nowait mode, and form the tuple
> array.
> + */
> +static void
> +form_tuple_array(GatherMergeState *gm_state, int reader)
>
> This function is stangely named. How about try_to_fill_tuple_buffer
> or something?
>
> + GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];
>
> I wonder if the purpose of gm_tuple, would be clearer if it were
> called gm_tuple_buffers. Plural because it holds one buffer per
> reader. Then in that variable on the left hand side there could be
> called tuple_buffer (singular), because it's the buffer of tuples for
> one single reader.
>
>
Yes, you are right. I renamed the variable as well as structure.

PFA latest patch which address the review comments as
well as few other clean ups.

Apart from this my colleague Rafia Sabih reported one regression with
GM. Which was like, if we set work_mem enough to accommodate the
sort operation - in such case GM path get select even though Sort
performs much better.

Example:

create table t (i int);
insert into t values(generate_series(1,10000000));
set work_mem =1024000;
explain analyze select * from t order by i;
set enable_gathermerge =off;
explain analyze select * from t order by i;

postgres=# explain analyze select * from t order by i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=335916.26..648415.76 rows=2499996 width=4)
(actual time=2234.145..7628.555 rows=10000000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Sort (cost=334916.22..341166.21 rows=2499996 width=4) (actual
time=2226.609..2611.041 rows=2000000 loops=5)
Sort Key: i
Sort Method: quicksort Memory: 147669kB
-> Parallel Seq Scan on t (cost=0.00..69247.96 rows=2499996
width=4) (actual time=0.034..323.129 rows=2000000 loops=5)
Planning time: 0.061 ms
Execution time: 8143.809 ms
(9 rows)

postgres=# set enable_gathermerge = off;
SET
postgres=# explain analyze select * from t order by i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Sort (cost=1306920.83..1331920.79 rows=9999985 width=4) (actual
time=3521.143..4854.148 rows=10000000 loops=1)
Sort Key: i
Sort Method: quicksort Memory: 854075kB
-> Seq Scan on t (cost=0.00..144247.85 rows=9999985 width=4)
(actual time=0.113..1340.758 rows=10000000 loops=1)
Planning time: 0.100 ms
Execution time: 5535.560 ms
(6 rows)

Looking at the plan I realize that this is happening because wrong costing
for Gather Merge. Here in the plan we can see the row estimated by
Gather Merge is wrong. This is because earlier patch GM was considering
rows = subpath->rows, which is not true as subpath is partial path. So
we need to multiple it with number of worker. Attached patch also fixed
this issues. I also run the TPC-H benchmark with the patch and results
are same as earlier.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rushabh Lathia 2016-11-11 12:58:02 Re: Gather Merge
Previous Message Ashutosh Bapat 2016-11-11 12:50:15 Re: Partition-wise join for join between (declaratively) partitioned tables