Re: queries with lots of UNIONed relations

From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: queries with lots of UNIONed relations
Date: 2011-01-13 23:14:04
Message-ID: AANLkTinubA5_CFAdERCyzDy1H3v409pyVTzdvPjTaPpr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I don't believe there is any case where hashing each individual relation
>>> is a win compared to hashing them all together.  If the optimizer were
>>> smart enough to be considering the situation as a whole, it would always
>>> do the latter.
>
>> You might be right, but I'm not sure.  Suppose that there are 100
>> inheritance children, and each has 10,000 distinct values, but none of
>> them are common between the tables.  In that situation, de-duplicating
>> each individual table requires a hash table that can hold 10,000
>> entries.  But deduplicating everything at once requires a hash table
>> that can hold 1,000,000 entries.
>
>> Or am I all wet?
>
> If you have enough memory to de-dup them individually, you surely have
> enough to de-dup all at once.  It is not possible for a single hashtable
> to have worse memory consumption than N hashtables followed by a union
> hashtable, and in fact if there are no common values then the latter eats
> twice as much space because every value appears twice in two different
> hashtables.

If everything were available up-front, sure.
However, and please correct me if I'm wrong, but doesn't postgresql
work in a fairly linear fashion, moving from table to table performing
a series of operations on each? That seems to indicate that is what
the plan is:

Compare:

for each table LOOP
scan table for result rows, append to results
END LOOP
hash / sort + unique results

versus:

for each table LOOP
scan table for result rows, append to table-results
hash / sort+unique table-results, append to results
END LOOP
hash / sort + unique results

In the former case, all of the result rows from all tables are
appended together before the de-duplification process can start.

In the latter case, only enough memory for each table's result set is
necessary for de-duplification, and it would only be necessary to
allocate it for that table.

Is that not how this works?

--
Jon

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2011-01-14 00:10:27 Re: queries with lots of UNIONed relations
Previous Message Tom Lane 2011-01-13 23:07:25 Re: queries with lots of UNIONed relations