Re: Joins on TID

From: Darafei "Komяpa" Praliaskouski <me(at)komzpa(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Joins on TID
Date: 2018-12-23 16:23:33
Message-ID: CAC8Q8tL_xL-ReBmdYXyd2-EQP2w3=XON=u2nvczKc1D-E0rsow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Writing as someone who used TID joins and group by's in the past.

One use case is having a chance to peek into what will DELETE do.
A lot of GIS tables don't have any notion of ID, and dirty datasets tend to
have many duplicates you need to cross-reference with something else. So,
you write your query in form of

CREATE TABLE ttt as (SELECT distinct on (ctid) ctid as ct, field1, field2,
b.field3, ... from table b join othertable b on ST_Whatever(a.geom,
b.geom));

<connect to table with QGIS, poke around, maybe delete some rows you doubt
you want to remove>

DELETE FROM table a USING ttt b where a.ctid = b.ct;
DROP TABLE ttt;

Here:
- distinct on ctid is used (hash?)
- a.ctid = b.ct (hash join candidate?)

I know it's all better with proper IDs, but sometimes it works like that,
usually just once per dataset.

сб, 22 дек. 2018 г. в 19:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Sat, 22 Dec 2018 at 04:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> BTW, if we're to start taking joins on TID seriously, we should also
> >> add the missing hash opclass for TID, so that you can do hash joins
> >> when dealing with a lot of rows.
>
> > I don't think we are trying to do TID joins more seriously, just fix a
> > special case.
> > The case cited requires the batches of work to be small, so nested loops
> > works fine.
> > Looks to me that Edmund is trying to solve the same problem. If so, this
> is
> > the best solution.
>
> No, I think what Edmund is on about is unrelated, except that it touches
> some of the same code. He's interested in problems like "find the last
> few tuples in this table". You can solve that today, with e.g.
> "SELECT ... WHERE ctid >= '(n,1)'", but you get a stupidly inefficient
> plan. If we think that's a use-case worth supporting then it'd be
> reasonable to provide less inefficient implementation(s).
>
> What I'm thinking about in this thread is joins on TID, which we have only
> very weak support for today --- you'll basically always wind up with a
> mergejoin, which requires full-table scan and sort of its inputs. Still,
> that's better than a naive nestloop, and for years we've been figuring
> that that was good enough. Several people in the other thread that
> I cited felt that that isn't good enough. But if we think it's worth
> taking seriously, then IMO we need to add both parameterized scans (for
> nestloop-with-inner-fetch-by-tid) and hash join, because each of those
> can dominate depending on how many tuples you're joining.
>
> regards, tom lane
>
> --
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2018-12-23 17:51:57 Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Previous Message Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?= 2018-12-23 15:33:34 Re: [PATCH] Improve tab completion for CREATE TABLE