Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices

From: Gunther <raj(at)gusw(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices
Date: 2017-11-02 17:14:29
Message-ID: ad733971-b373-e140-49e6-c077232dda23@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks you for your thoughtful reply, Laurenz (funny that the people
interested in this topic are named Laurent and Laurenz :)

> PostgreSQL doesn't have a way to tell if a query is an OLAP query
> running against a star schema or a regular OLTP query, it will treat
> both in the same fashion.
right, of course, and I would not want to go down that road. There OLAP
vs. OLTP are not just two cut and dry options, and neither is "star
schema" but one way in which to lay out a simple data model. The real
world is always more complex than such cut and dry choices
> However, it is not true that PostgreSQL "perfers nested loops".
> Sometimes a nested loop join is the only sane and efficient
> way to process a query ...
of course, it's not preferring NLs deliberately, but it happens awfully
often (and not just with PgSQL, same problems I have had with Oracle
over the years).
> Bad choices are almost always caused by bad estimates.
> Granted, there is no way that estimates can ever be perfect.
> ...
> Looking deeper, I would say that wrongly chosen nested loop joins
> often come from an underestimate that is close to zero.
> PostgreSQL already clamps row count estimates to 1, that is, it will
> choose an estimate of 1 whenever it thinks fewer rows will be returned.
>
> Perhaps using a higher clamp like 2 would get rid of many of your
> problems, but it is a difficult gamble as it will also prevent some
> nested loop joins that would have been the best solution.
Wow, that is very interesting! Are you saying that if PgSQL can't know
what the cardinality is, it assumes a default of 1? That would be very
slanted a guess. I would think a couple of hundred would be more
appropriate, or 10% of the average of the base tables for which it does
have statistics. I would wonder if changing 1 to 2 would make much
difference, as Seq Search over 1 to 10 tuples should generally be better
than any other approach, as long as the 1-10 tuples are already readily
available.
> Finally, even though the official line of PostgreSQL is to *not* have
> query hints, and for a number of good reasons, this is far from being
> an unanimous decision. The scales may tip at some point, though I
> personally hope that this point is not too close.

I am glad to hear that hints are not completely ruled out by the
development team. Definitely Oracle hints are painful and should not be
replicated as is.  Butmay be I can nudge your (and others') personal
tastes with the following.

You suggested this:

> One pragmatic solution would be to wrap every query that you know
> to be an OLAP query with
> BEGIN;
> SET LOCAL enable_nestloop=off;
> SELECT ...
> COMMIT;
I would also like to put the set enable_nestloop = false statement into
a combined statement, but when I do it in a transaction like you showed,
it would not work for a normal PreparedStatement just expecting a
ResultSet, or at least I haven't been able to make that work. In my Aqua
Data Studio, if I put the set statement before the select statement, the
combined statement doesn't return any results. May be I am doing
something wrong. If there is a way, then I would ave what I need.

If not, I think it might be an easy thing to add.

We already have different scopes of these optimizer parameters like
enable_nestloop

1. the system wide scope

2. a session wide scope

and I see no reason why one could not just add a non-disruptive syntax
form to change these parameters on a statement-wide scope. By all means
in a comment.

Why not

--! set enable_nestloop = false
--! set work_mem = '20 MB'
SELECT *
  FROM ....
;

something like that. It would not be a big deal, no completely new
obscure hint syntax.

And may be, if that is possible so far, then why not add a CTE scope as
well:

WITH Foo AS (
--! set enable_nestloop = false
  SELECT * FROM ... INNER JOIN ... INNER JOIN ... INNER JOIN ... ...
) , Bar AS (
  SELECT * FROM Foo INNER JOIN IndexedTable USING(a, b, c)
)
SELECT * FROM Bar ...
;

this would keep the nestloop off for the CTE Foo with that complex join
but allow it to be used for the CTE Bar or the ultimate query.

I think these features should be relatively easy to add without causing
SQL compatibility issue and also not opening a can of worms with obscure
hint features that need a lot of work to implement correctly.

But while we are at dreaming up solution, I think materialized indexed
sub-plans would also be a nice ting, especially when dealing with CTEs.
This could be controlled manually to begin with:

WITH Foo AS (
--! set enable_nestloop = false
  SELECT * FROM ... INNER JOIN ... INNER JOIN ... INNER JOIN ... ...
)  MATERIALIZE INDEX ON(a, b, c)
, Bar AS (
  SELECT * FROM Foo INNER JOIN IndexedTable USING(a, b, c)
)
SELECT * FROM Bar ...
;

And of course if we don't want to disturb SQL syntax, the "materialize
index on ..." clause could be in a --! comment.

But then, to dream on, PgSQL could make sub-query plans a temporary
table and add indexes that it needs for it on the fly, because after it
has done one pass over the inner loop sequential scan it already has a
perfect guess of what the cardinality is, and knowing how many more
iterations are coming from the sub-query that's driving the Nested Loop,
it could decide that it's much faster to put an index on the nested
relation, temporarily materialized. Or it could even decide to change
it's plan mid-way and do the Hash Join.

Let's call them dynamic feedback plan optimization.

This is why I had always dreamed that the PgSQL optimizer had some easy
API where one could plug in experimental strategies. I personally am
extremely efficient with XSLT for complex intelligent algorithms, and I
dream of a PgSQL query plan structure exposed as XML which an XSLT
plugin could then process to edit the plan. People could experiment with
awesome intelligent new strategies based on statistics gathered along
the way of the execution.

regards,
-Gunther

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2017-11-02 17:21:54 Re: proposal: schema variables
Previous Message Gunther 2017-11-02 16:36:13 Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices