Re: asynchronous and vectorized execution

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: asynchronous and vectorized execution
Date: 2016-06-29 15:00:21
Message-ID: CAJ3gD9cN6dj8+jYES0M7MZ1pZdj4x6L5NVgw1aUwoShpJ1WhTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We may also want to consider handling abstract events such as
"tuples-are-available-at-plan-node-X".

One benefit is : we can combine this with batch processing. For e.g. in
case of an Append node containing foreign scans, its parent node may not
want to process the Append node result until Append is ready with at least
1000 rows. In that case, Append node needs to raise an event
"n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready
event will wake up the foreign scan, but it may not eventually cause its
parent Append node to in turn wake up it's parent.

Other benefit (which I am not sure how significant it is) is this part of
the event : "at-plan-node-X". For e.g., for an Append node having 10
foreign scans, when a foreign scan wakes up and becomes ready with
tuple(s), it's parent node (i.e. Append) will be executed. But it would
speed up things if it knows which of it's foreign scan nodes are ready.
From Robert's prototype, I can see that it can find that out by checking
the result_ready field of each foreign scan plan state. But if it knows
from the event that the node-X is the one who is ready, it can directly
take tuples from there. Another thing is, we may want to give the Append
node a chance to know all those nodes that are ready, instead of just one
node.

How we can do this event abstraction is the other question. We can have one
latch for each of the event, and each node would raise its own event by
setting the corresponding latch. But I am not sure about latches within a
single process as against one process waking up another process. Or else,
some internal event structures needs to be present (in estate ?), which
then ExecProcNode would use when it does the event looping to wake up (i.e.
execute) required nodes.

Also, the WaitEvent.user_data field can have some more info besides the
plan state. It can have its parent PlanState stored, so that we don't have
to have parent field in plan state. It also can have some more data such as
"n-tuples-available".

On 9 May 2016 at 23:03, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Hi,
>
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there. I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who). To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor. It's too
> slow and it doesn't do everything we want it to do. The main things
> on my mind are:
>
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits. This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual. It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS. Whether or not this will be efficient is a
> research question, but it can be done. However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer. In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree. For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
>
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one. Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later. I think that's
> probably right. For example, consider a 5-table join where all of
> the joins are implemented as hash tables. If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on. What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat. If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse. But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join. The inner loop fetches the next tuple from the
> probe table and does a hash lookup. Right now, fetching the next
> tuple from the probe table means calling a function which in turn
> calls another function which probably calls another function which
> probably calls another function and now about 4 layers down we
> actually get the next tuple. If the scan returned a batch of tuples
> to the hash join, fetching the next tuple from the batch would
> probably be 0 or 1 function calls rather than ... more. Admittedly,
> you've got to consider the cost of marshaling the batches but I'm
> optimistic that there are cycles to be squeezed out here. We might
> also want to consider storing batches of tuples in a column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.
>
> Obviously, both of these are big projects that could touch a large
> amount of executor code, and there may be other ideas, in addition to
> these, which some of you may be thinking about that could also touch a
> large amount of executor code. It would be nice to agree on a way
> forward that minimizes code churn and maximizes everyone's attempt to
> contribute without conflicting with each other. Also, it seems
> desirable to enable, as far as possible, incremental development - in
> particular, it seems to me that it would be good to pick a design that
> doesn't require massive changes to every node all at once. A single
> patch that adds some capability to every node in the executor in one
> fell swoop is going to be too large to review effectively.
>
> My proposal for how to do this is to make ExecProcNode function as a
> backward-compatibility wrapper. For asynchronous execution, a node
> might return a not-ready-yet indication, but if that node is called
> via ExecProcNode, it means the caller isn't prepared to receive such
> an indication, so ExecProcNode will just wait for the node to become
> ready and then return the tuple. Similarly, for vectorized execution,
> a node might return a bunch of tuples all at once. ExecProcNode will
> extract the first one and return it to the caller, and subsequent
> calls to ExecProcNode will iterate through the rest of the batch, only
> calling the underlying node-specific function when the batch is
> exhausted. In this way, code that doesn't know about the new stuff
> can continue to work pretty much as it does today. Also, and I think
> this is important, nodes don't need the permission of their parent
> node to use these new capabilities. They can use them whenever they
> wish, without worrying about whether the upper node is prepared to
> deal with it. If not, ExecProcNode will paper over the problem. This
> seems to me to be a good way to keep the code simple.
>
> For asynchronous execution, I have gone so far as to mock up a bit of
> what this might look like. This shouldn't be taken very seriously at
> this point, but I'm attaching a few very-much-WIP patches to show the
> direction of my line of thinking. Basically, I propose to have
> ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
> by putting them into a new PlanState member called "result", which is
> just a Node * so that we can support multiple types of results,
> instead of returning them. There is also a result_ready boolean, so
> that a node can return without setting this Boolean to engage
> asynchronous behavior. This triggers an "event loop", which
> repeatedly waits for FDs chosen by waiting nodes to become readable
> and/or writeable and then gives the node a chance to react.
> Eventually, the waiting node will stop waiting and have a result
> ready, at which point the event loop will give the parent of that node
> a chance to run. If that node consequently becomes ready, then its
> parent gets a chance to run. Eventually (we hope), the node for which
> we're waiting becomes ready, and we can then read a result tuple.
> With some more work, this seems like it can handle the FDW case, but I
> haven't worked out how to make it handle the related parallel query
> case. What we want there is to wait not for the readiness of an FD
> but rather for some other process involved in the parallel query to
> reach a point where it can welcome assistance executing that node. I
> don't know exactly what the signaling for that should look like yet -
> maybe setting the process latch or something.
>
> By the way, one smaller executor project that I think we should also
> look at has to do with this comment in nodeSeqScan.c:
>
> static bool
> SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
> {
> /*
> * Note that unlike IndexScan, SeqScan never use keys in
> heap_beginscan
> * (and this is very bad) - so, here we do not check are keys ok
> or not.
> */
> return true;
> }
>
> Some quick prototyping by my colleague Dilip Kumar suggests that, in
> fact, there are cases where pushing down keys into heap_beginscan()
> could be significantly faster. Some care is required here because any
> functions we execute as scan keys are run with the buffer locked, so
> we had better not run anything very complicated. But doing this for
> simple things like integer equality operators seems like it could save
> quite a few buffer lock/unlock cycles and some other executor overhead
> as well.
>
> Thoughts, ideas, suggestions, etc. very welcome.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-06-29 15:04:51 Re: Strange behavior of some volatile function like random(), nextval()
Previous Message Shawn 2016-06-29 14:56:35 Re: An unkillable connection caused replication delay on my replica