Re: queries with lots of UNIONed relations

From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: queries with lots of UNIONed relations
Date: 2011-01-14 20:11:11
Message-ID: AANLkTinJpNmdothKo67QK0eZN20s4Mfxbq5b7tfSEW0C@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> If you have enough memory to de-dup them individually, you surely have
>>> enough to de-dup all at once.
>
>> If everything were available up-front, sure.
>> However, and please correct me if I'm wrong, but doesn't postgresql
>> work in a fairly linear fashion, moving from table to table performing
>> a series of operations on each?
>
> Doing a single sort+uniq works like that.  But the alternate plan you
> are proposing we should consider involves building all the lower
> hashtables, and then reading from them to fill the upper hashtable.
> Max memory consumption *is* worst case here.  Remember HashAggregate
> is incapable of swapping to disk (and if it did, you wouldn't be nearly
> as pleased with its performance).

That's not exactly what I'm proposing - but it is probably due to a
lack of understanding some of the underlying details of how postgresql
works. I guess I had assumed that the result of a HashAggregate or any
other de-duplication process was a table-like structure.

Regarding being pleased with hash aggregate - I am! - except when it
goes crazy and eats all of the memory in the machine. I'd trade a bit
of performance loss for not using up all of the memory and crashing.

However, maybe I'm misunderstanding how SELECT DISTINCT works internally.
In the case where a hashtable is used, does postgresql utilize
table-like structure or does it remain a hashtable in memory?

If it's a hashtable, couldn't the hashtable be built on-the-go rather
than only after all of the underlying tuples are available?

I'd love a bit more explanation as to how this works.

Another example of where this might be useful: I'm currently running
a SELECT DISTINCT query over some 500 million rows (120 contributory
tables). I expect a de-duplicated row count of well under 10% of that
500 million, probably below 1%. The plan as it stands is to execute a
series of sequential scans, appending each of the rows from each
contributory table and then aggregating them. If the expected
distinctness of each contributory subquery is, say, 5% then instead of
aggregating over 500 million tuples the aggregation would take place
over 25 million. In this case, that is a savings of 10 gigabytes,
approximately.

Yes, it's true, the same amount of data has to be scanned. However,
the amount of data that needs to be stored (in memory or on disk) in
order to provide a final de-duplication is much smaller.

--
Jon

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Maciek Sakrejda 2011-01-14 20:27:02 Re: "COPY TO stdout" statements occurrence in log files
Previous Message Fernando Mertins 2011-01-14 19:42:23 "COPY TO stdout" statements occurrence in log files