Re: queries with lots of UNIONed relations

From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: queries with lots of UNIONed relations
Date: 2011-01-15 18:15:06
Message-ID: AANLkTik3580kzdmKLy66ZX-RFufSXJ4iMQUWsabTHqVb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> writes:
>>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> If you have enough memory to de-dup them individually, you surely have
>>>> enough to de-dup all at once.
>>
>>> If everything were available up-front, sure.
>>> However, and please correct me if I'm wrong, but doesn't postgresql
>>> work in a fairly linear fashion, moving from table to table performing
>>> a series of operations on each?
>>
>> Doing a single sort+uniq works like that.  But the alternate plan you
>> are proposing we should consider involves building all the lower
>> hashtables, and then reading from them to fill the upper hashtable.
>> Max memory consumption *is* worst case here.  Remember HashAggregate
>> is incapable of swapping to disk (and if it did, you wouldn't be nearly
>> as pleased with its performance).
>
> That's not exactly what I'm proposing - but it is probably due to a
> lack of understanding some of the underlying details of how postgresql
> works. I guess I had assumed that the result of a HashAggregate or any
> other de-duplication process was a table-like structure.

And I assumed wrong, I think. I dug into the code (nodeHash.c and
others) and I think I understand now why HashAggregate works only in
certain circumstances, and I think I understand your comments a bit
better now. Basically, HashAggregate doesn't stream unique Nodes the
way nodeUnique.c does. nodeUnique basically emits Nodes and elides
subsequent, identical Nodes, which is why it relies on the input being
sorted. HashAggregate works only on entire input sets at once, and
nodeHash.c doesn't emit Nodes at all, really.

This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be
modified to output Nodes in a streaming fashion. The memory
requirements would not be any worse than now.

Does postgresql support any sort of merge sort? If it did, then if
the hashtable started consuming too much memory, it could be cleared
and the nodes output from the new hashtable could be directed to
another temporary file, and then a merge sort could be performed on
all of the temporary files (and thus Unique could be used to affect
the UNION operation).

--
Jon

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Shaun Thomas 2011-01-15 19:54:29 Re: Best way to get the latest revision from a table
Previous Message Barbara Woolums 2011-01-15 17:56:27 Problem with query