From: | David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: asynchronous and vectorized execution |
Date: | 2016-05-10 00:34:19 |
Message-ID: | CAKJS1f_N16Chu-rEbFDVq1aw-dC+aOMqpb4XovuudRZ+Pc9nYA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10 May 2016 at 05:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one. Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later. I think that's
> probably right. For example, consider a 5-table join where all of
> the joins are implemented as hash tables. If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on. What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat. If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse. But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join. The inner loop fetches the next tuple from the
> probe table and does a hash lookup. Right now, fetching the next
> tuple from the probe table means calling a function which in turn
> calls another function which probably calls another function which
> probably calls another function and now about 4 layers down we
> actually get the next tuple. If the scan returned a batch of tuples
> to the hash join, fetching the next tuple from the batch would
> probably be 0 or 1 function calls rather than ... more. Admittedly,
> you've got to consider the cost of marshaling the batches but I'm
> optimistic that there are cycles to be squeezed out here. We might
> also want to consider storing batches of tuples in a column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.
It's interesting that you mention this. We identified this as a pain
point during our work on column stores last year. Simply passing
single tuples around the executor is really unfriendly towards L1
instruction cache, plus also the points you mention about L3 cache and
hash tables and tuple stores. I really think that we're likely to see
significant gains by processing >1 tuple at a time, so this topic very
much interests me.
On researching this we've found that other peoples research does
indicate that there are gains to be had:
http://www.openlinksw.com/weblog/oerling/
In that blog there's a table that indicates that this row-store
database saw a 4.4x performance improvement from changing from a
tuple-at-a-time executor to a batch tuple executor.
Batch Size 1 tuple = 122 seconds
Batch Size 10k tuples = 27.7 seconds
When we start multiplying those increases with the increases with
something like parallel query then we're starting to see very nice
gains in performance.
Alvaro, Tomas and I had been discussing this and late last year I did
look into what would be required to allow this to happen in Postgres.
Basically there's 2 sub-projects, I'll describe what I've managed to
learn so far about each, and the rough plan that I have to implement
them:
1. Batch Execution:
a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
indicate if the struct contains a single or a multiple tuples.
Multiple tuples may need to be deformed in a non-lazy fashion in order
to prevent too many buffers from having to be pinned at once. Tuples
will be deformed into arrays of each column rather than arrays for
each tuple (this part is important to support the next sub-project)
c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.
That at least is my ideas for stage 1. There's still more to work out.
e.g should batch mode occur when the query has a LIMIT? we might not
want to waste time gather up extra tuples when we're just going to
stop after the first one. So perhaps 'e' above should be up to the
planner instead. Further development work here might add a node type
that de-batches a TupleTableSlot to allow nodes which don't support
batching to be in the plan, i.e "mixed execution mode". I'm less
excited about this as it may be difficult to cost that operation,
probably the time would be better spend just batch-enabling the other
node types, which *may* not be all that difficult. I'm also assuming
that batch mode (in all cases apart from queries with LIMIT or
cursors) will always be faster than tuple-at-a-time, so requires no
costings from the planner.
2. Vector processing
(I admit that I've given this part much less thought so far, but
here's what I have in mind)
This depends on batch execution, and is intended to allow the executor
to perform function calls to an entire batch at once, rather than
tuple-at-a-time. For example, let's take the following example;
SELECT a+b FROM t;
here (as of now) we'd scan "t" one row at a time and perform a+b after
having deformed enough of the tuple to do that. We'd then go and get
another Tuple from the scan node and repeat until the scan gave us no
more Tuples.
With batch execution we'd fetch multiple Tuples from the scan and we'd
then perform the call to say int4_pl() multiple times, which still
kinda sucks as it means calling int4_pl() possibly millions of times
(once per tuple). The vector mode here would require that we modify
pg_operator to add a vector function for each operator so that we can
call the function passing in an array of Datums and a length to have
SIMD operations perform the addition, so we'd call something like
int4_pl_vector() only once per batch of tuples allowing the CPU to
perform SIMD operations on those datum arrays. This could be done in
an incremental way as the code could just callback on the standard
function in cases where a vectorised version of it is not available.
Thought is needed here about when exactly this decision is made as the
user may not have permissions to execute the vector function, so it
can't simply be a run time check. These functions would simply return
another vector of the results. Aggregates could be given a vector
transition function, where something like COUNT(*)'s vector_transfn
would simply just current_count += vector_length;
This project does appear to require that we bloat the code with 100's
of vector versions of each function. I'm not quite sure if there's a
better way to handle this. The problem is that the fmgr is pretty much
a barrier to SIMD operations, and this was the only idea that I've had
so far about breaking through that barrier. So further ideas here are
very welcome.
The idea here is that these 2 projects help pave the way to bring
columnar storage into PostgreSQL. Without these we're unlikely to get
much benefit of columnar storage as we'd be stuck processing rows one
at a time still. Adding columnar storage on the top of the above
should further increase performance as we can skip the tuple-deform
step and pull columnar array/vectors directly into a TupleTableSlot,
although some trickery would be involved here when the scan has keys.
I just want to add that both of the above do require more thought. We
realised that this was required quite late in our column store work
(which we've all now taken a break from to work on other things), so
we've had little time to look much further into it. Although I should
be starting work again on this in the next few months in the hopes to
have something, even the most simple version of it in 9.7.
Comments are welcome
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Josh berkus | 2016-05-10 00:43:18 | Please help update the "How to beta Test" page |
Previous Message | Ants Aasma | 2016-05-09 23:40:11 | Re: Reviewing freeze map code |