Re: [SQL] Confusion about DISTINCT.

From: Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chris Bitmead <chris(dot)bitmead(at)bigfoot(dot)com>, "pgsql-sql(at)postgresql(dot)org" <pgsql-sql(at)postgreSQL(dot)org>
Subject: Re: [SQL] Confusion about DISTINCT.
Date: 1999-05-10 16:25:06
Message-ID: 199905101625.MAA06754@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Nice summary of the issue.

> Chris Bitmead <chris(dot)bitmead(at)bigfoot(dot)com> writes:
> > I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when
> > you don't use "ON".
>
> Means the equality test is taken across all columns of the output.
> "ON" restricts the test for "duplicate row" to look at just some
> of the columns (thus creating the issue of which tuple out of a
> group of "duplicates" gets through the filter to determine the values
> of the other columns).
>
> > Like this...
> > SELECT DISTINCT category.image FROM story, category* WHERE
> > story.category = category.oid ORDER BY datetime DESC;
> > The DISTINCT here has no effect on the output other than changing the
> > ordering. There are duplicates!
>
> This is a known bug, or at least IMHO it's a bug; it's triggered by your
> use of ORDER BY with DISTINCT. You need to say
> ORDER BY category.image, datetime DESC
> to make the above example work.
>
> The reason is that DISTINCT is implemented by a sort followed by an
> adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix
> shell programming). When you put on an explicit ORDER BY clause,
> you override the sort order that the DISTINCT wants, and so the
> duplicate filter doesn't necessarily spot the duplicates. You have
> to make sure to sort in an order that will keep the unwanted duplicates
> together, ie, list all the DISTINCT columns first in the ORDER BY.
>
> A straightforward solution would be to apply the user-specified ORDER BY
> sort *after* the DISTINCT-generated sort and dup-filter steps. I used
> to think this was the right fix. However, this would break that nifty
> technique of letting a user-specified ORDER BY resolve the ambiguity of
> DISTINCT ON --- the user ordering has to happen before the dup-filter
> for that to work.
>
> Alternatively, the system could alter the user's ORDER BY to ensure the
> DISTINCT columns are the primary sort keys --- ie, silently change your
> ORDER BY in the above example. I don't think I like that too much
> either. For one thing, people would complain that the resulting order
> wasn't what they asked for. For another, it seems possible that there
> are applications where applying the dup-filter to columns that aren't
> the primary keys might be useful. Let's see, if you did
> SELECT DISTINCT category.image FROM story, category* WHERE
> story.category = category.oid ORDER BY datetime DESC, category.image;
> then you'd get each image listed only once *per date* (I think).
> Doesn't that sound like it could be a useful behavior?
>
> So right at the moment, I think the system ought not tinker with the
> user's ORDER BY. But it probably should emit a warning message if the
> ORDER BY fails to mention all the columns being DISTINCTed on. If you
> aren't sorting by a DISTINCT column at all, you won't even get
> reproducible results, let alone desirable ones.
>
> > 2nd question: Why does the following query result in duplicates even
> > though I use DISTINCT ON?
>
> Same deal.
>
> regards, tom lane
>
>

--
Bruce Momjian | http://www.op.net/~candle
maillist(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 1999-05-10 16:29:26 Re: [SQL] Simple Optimization Problem
Previous Message Matthias Seiferth 1999-05-10 15:23:02 Retrieving column names and table names of a database