From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | "Dirschel, Steve" <steve(dot)dirschel(at)thomsonreuters(dot)com> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Hash join and picking which result set to build the hash table with. |
Date: | 2024-05-22 20:53:53 |
Message-ID: | 14387.1716411233@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
"Dirschel, Steve" <steve(dot)dirschel(at)thomsonreuters(dot)com> writes:
> The query and execution plan are shown below. My question is
> related to the result set the optimizer is choosing to build the
> hash table from. My understanding is for a hash join you want to
> build the hash table out of the smaller result set.
That's *a* consideration, but not the only one. We also consider
whether the hash key has a flat distribution; if it is too skewed,
we might find specific hash chains getting too long.
> When running some tests I forgot to create the PK on table
> docloc_test. When the PK was not on the table the optimizer decided
> to create the hash table off the 1000 rows from collection. But as
> soon as I put the PK on that table it then decides to use
> docloc_test to build the hash table.
I think that the presence of a unique index overrides the statistics
(or the lack of any) so that the planner knows that the column is
unique and thus safe to use as a hash key. Now, it should have
known that anyway, unless maybe this is a freshly-built table that
auto-analyze hasn't gotten to yet?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | sud | 2024-05-22 21:16:31 | Long running query causing XID limit breach |
Previous Message | Dirschel, Steve | 2024-05-22 20:42:39 | Hash join and picking which result set to build the hash table with. |