Re: [PERFORM] Hash Anti Join performance degradation

From: panam <panam(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 22:49:31
Message-ID: 1306968571518-4446629.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

I'd like to thank you all for getting this analyzed, especially Tom!
Your rigor is pretty impressive. Seems like otherwise it'd impossible to
maintain a DBS, though.
In the end, I know a lot more of postgres internals and that this
idiosyncrasy (from a user perspective) could happen again. I guess it is my
first time where I actually encountered an unexpected worst case scenario
like this...
Seems it is up to me know to be a bit more creative with query optimzation.
And in the end, it'll turn out to require an architectural change...
As the only thing to achieve is in fact to obtain the last id (currently
still with the constraint that it has to happen in an isolated subquery), i
wonder whether this requirement (obtaining the last id) is worth a special
technique/instrumentation/strategy ( lacking a good word here), given the
fact that this data has a full logical ordering (in this case even total)
and the use case is quite common I guess.

Some ideas from an earlier post:

panam wrote:
>
> ...
> This also made me wonder how the internal plan is carried out. Is the
> engine able to leverage the fact that a part/range of the rows ["/index
> entries"] is totally or partially ordered on disk, e.g. using some kind of
> binary search or even "nearest neighbor"-search in that section (i.e. a
> special "micro-plan" or algorithm)? Or is the speed-up "just" because
> related data is usually "nearby" and most of the standard algorithms work
> best with clustered data?
> If the first is not the case, would that be a potential point for
> improvement? Maybe it would even be more efficient, if there were some
> sort of constraints that guarantee "ordered row" sections on the disk,
> i.e. preventing the addition of a row that had an index value in between
> two row values of an already ordered/clustered section. In the simplest
> case, it would start with the "first" row and end with the "last" row (on
> the time of doing the equivalent of "cluster"). So there would be a small
> list saying rows with id x - rows with id y are guaranteed to be ordered
> on disk (by id for example) now and for all times.
>

Maybe I am completely off the mark but what's your conclusion? To much
effort for small scenarios? Nothing that should be handled on a DB level? A
try to battle the laws of thermodynamics with small technical dodge?

Thanks again
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-PERFORM-Hash-Anti-Join-performance-degradation-tp4443803p4446629.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-06-01 22:53:50 Re: pgpool versus sequences
Previous Message Alvaro Herrera 2011-06-01 22:47:26 Re: creating CHECK constraints as NOT VALID

Browse pgsql-performance by date

  From Date Subject
Next Message Robert James 2011-06-01 23:54:35 CLUSTER versus a dedicated table
Previous Message CS DBA 2011-06-01 22:28:33 Re: Problem query