execExprInterp() questions / How to improve scalar array op expr eval?

From: James Coleman <jtc331(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: execExprInterp() questions / How to improve scalar array op expr eval?
Date: 2020-04-11 12:58:46
Message-ID: CAAaqYe-UQBba7sScrucDOyHb7cDoNbWf_rcLrOWeD4ikP3_qTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Short version:

In what I'm currently working on I had a few questions about arrays
and the execExpr/execExprInterp framework that didn't seem obviously
answered in the code or README.

- Does the execExpr/execExprInterp framework allow a scalar array op
to get an already expanded array (unless I'm missing something we
can't easily lookup a given index in a flattened array)?
- If not, is there a way in that framework to know if the array expr
has stayed the same through multiple evaluations of the expression
tree (i.e., so you could expand and sort it just once)?

Long version/background:

I've been looking at how a query like:
select * from t where i in (<long array>);
executes as a seq scan the execution time increases linearly with
respect to the length of the array. That's because in
execExprInterp.c's ExecEvalScalarArrayOp() we do a linear search
through the array.

In contrast, when setting up a btree scan with a similar saop, we
first sort the array, remove duplicates, and remove nulls. Of course
with btree scans this has other values, like allowing us to return
results in array order.

I've been considering approaches to improve the seq scan case. We might:
- At plan time rewrite as a hash join to a deduped array values
relation (gross, but included here since that's an approach we can
take rewriting the SQL itself as a proof of concept).
- At execution time build a hash and lookup.
- At execution time sort the array and binary search through it.

I've been working other last approach to see what results I might get
(it seemed like the easiest to hack together). Putting that together
left me with the questions mentioned above in the "short version".

Obviously if anyone has thoughts on the above approaches I'd be
interested in that too.

Side question: when we do:
arr = DatumGetArrayTypeP(*op->resvalue);
in ExecEvalScalarArrayOp() is that going to be expensive each time
through a seq scan? Or is it (likely) going to resolve to an already
in-memory array and effectively be the cost of retrieving that
pointer?

James

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-04-11 13:24:47 Re: WAL usage calculation patch
Previous Message Jürgen Purtz 2020-04-11 12:10:21 Re: Add A Glossary