Re: Accounting for between table correlation

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Alexander Stoddard <alexander(dot)stoddard(at)gmail(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Thomas Kellerer <shammat(at)gmx(dot)net>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Accounting for between table correlation
Date: 2021-01-15 18:17:35
Message-ID: 65a6db71-9182-1f0a-02c7-39cd011b34dd@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 1/15/21 9:21 AM, Alexander Stoddard wrote:
> Unfortunately I'm not free to share the specific schema or the query
> plans. They derive from an upstream vendor that is 'protective' of their
> data model. To get to a proper example I'll need to recreate the
> behavior with generic data in a generified schema.
>
> I apologize for being frustratingly vague. I do feel like an idiot for
> not at least saying this was with version PG 11.10.
> It has been beneficial to me that the replies so far appear to validate
> my understanding that no version of postgres has cross table correlation
> statistics in the planner.
>
> Analyze is done frequently, and I think at sufficient sample size. The
> context is in a large data analysis setting and the data is changed via
> bulk ETL not OLTP. The effect on analyzing is just instability - the
> plan can flip in either direction (without underlying data changing)
> between giving an answer in minutes and timing out after 10s of hours.
> That could be indicative of too small a sample but I think in this case
> it is more the statistics aren't useful because it is cross table
> correlation that can't be accounted for.

So to be clear, the process imports the data, then you run a query and
it completes in x time, you then ANALYZE the same data and it runs in y
time. Is that correct?

>
> The 'fast plans' use parallel seq scans. The 'slow plans' is using index
> scans. It appears a good query plan correctly predicts it should be bulk
> processing the tables but bad ones get fooled into trashing (hard disk,
> not SSD) by mispredicting too few rows to join between the tables.
>
> If anyone has similar experiences and is generous enough with their time
> to share possible solutions/work arounds then I'm most grateful. If my
> description is too vague to be worthwhile answering then I quite
> understand and apologize for the time wasted in reading.
>
> Thank you.
>
>
>
>
>
> On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston
> <david(dot)g(dot)johnston(at)gmail(dot)com <mailto:david(dot)g(dot)johnston(at)gmail(dot)com>> wrote:
>
> On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat(at)gmx(dot)net
> <mailto:shammat(at)gmx(dot)net>> wrote:
>
> Atul Kumar schrieb am 15.01.2021 um 16:29:
> > As per Ron, you are not supposed to ask your questions here.
> >
> > As According to him, we should keep on doing research on internet
> > rather than asking for support directly even you have done enough
> > research and until unless “Ron” won’t be satisfied you have to do
> > keep on researching.
>
> Ron's question was perfectly valid.
>
> Missing and wrong statistics are one reason for the planner to
> choose a bad execution plan.
>
>
> Yeah, at first blush I didn't think analyze really mattered (and it
> mostly doesn't because while you can keep the statistics up-to-date
> the multi-table nature of the problem means they are only marginally
> helpful here), but that just points out the under-specified nature
> of the original posting.  Taken as a simple question of "is there a
> way to work around the lack of multi-table statistics" the analyze,
> and even the specific queries, don't matter all that much.  But it
> also would be much more useful if the OP would choose a single
> problematic query and show the schema, query, and explain results,
> hopefully both good and bad, and comment on how analyze seems to
> affect the plan choice.  But for the general question about
> overcoming our statistics limitations the analyze point is not relevant.
>
> David J.
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2021-01-15 18:18:55 Re: Best tools to monitor and fine tune postgres
Previous Message Alexander Stoddard 2021-01-15 17:21:56 Re: Accounting for between table correlation