Re: Accounting for between table correlation

From: Alexander Stoddard <alexander(dot)stoddard(at)gmail(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Thomas Kellerer <shammat(at)gmx(dot)net>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Accounting for between table correlation
Date: 2021-01-15 17:21:56
Message-ID: CADDNc-B7fHboEt8NZk4PPcXL6KmEpY9qgXywPs656_h_FThBWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Unfortunately I'm not free to share the specific schema or the query plans.
They derive from an upstream vendor that is 'protective' of their data
model. To get to a proper example I'll need to recreate the behavior with
generic data in a generified schema.

I apologize for being frustratingly vague. I do feel like an idiot for not
at least saying this was with version PG 11.10.
It has been beneficial to me that the replies so far appear to validate my
understanding that no version of postgres has cross table correlation
statistics in the planner.

Analyze is done frequently, and I think at sufficient sample size. The
context is in a large data analysis setting and the data is changed via
bulk ETL not OLTP. The effect on analyzing is just instability - the plan
can flip in either direction (without underlying data changing) between
giving an answer in minutes and timing out after 10s of hours. That could
be indicative of too small a sample but I think in this case it is more the
statistics aren't useful because it is cross table correlation that can't
be accounted for.

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk, not
SSD) by mispredicting too few rows to join between the tables.

If anyone has similar experiences and is generous enough with their time to
share possible solutions/work arounds then I'm most grateful. If my
description is too vague to be worthwhile answering then I quite understand
and apologize for the time wasted in reading.

Thank you.

On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston <
david(dot)g(dot)johnston(at)gmail(dot)com> wrote:

> On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat(at)gmx(dot)net> wrote:
>
>> Atul Kumar schrieb am 15.01.2021 um 16:29:
>> > As per Ron, you are not supposed to ask your questions here.
>> >
>> > As According to him, we should keep on doing research on internet
>> > rather than asking for support directly even you have done enough
>> > research and until unless “Ron” won’t be satisfied you have to do
>> > keep on researching.
>>
>> Ron's question was perfectly valid.
>>
>> Missing and wrong statistics are one reason for the planner to choose a
>> bad execution plan.
>>
>
> Yeah, at first blush I didn't think analyze really mattered (and it mostly
> doesn't because while you can keep the statistics up-to-date the
> multi-table nature of the problem means they are only marginally helpful
> here), but that just points out the under-specified nature of the original
> posting. Taken as a simple question of "is there a way to work around the
> lack of multi-table statistics" the analyze, and even the specific queries,
> don't matter all that much. But it also would be much more useful if the
> OP would choose a single problematic query and show the schema, query, and
> explain results, hopefully both good and bad, and comment on how analyze
> seems to affect the plan choice. But for the general question about
> overcoming our statistics limitations the analyze point is not relevant.
>
> David J.
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2021-01-15 18:17:35 Re: Accounting for between table correlation
Previous Message Atul Kumar 2021-01-15 16:44:26 Re: Best tools to monitor and fine tune postgres