Re: big joins not converging

From: fork <forkandwait(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: big joins not converging
Date: 2011-03-10 23:48:53
Message-ID: loom.20110311T003919-939@post.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Steve Atkins <steve <at> blighty.com> writes:

>
>
> On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:
>
> > Hi postgressers -
> >
> > As part of my work with voter file data, I pretty regularly have to join one
large-ish (over 500k rows) table
> to another. Sometimes this is via a text field (countyname) + integer (voter
id). I've noticed sometimes
> this converges and sometimes it doesn't, seemingly regardless of how I index
things.

By "converge" you mean "finish running" -- "converge" has a lot of other
overtones for us amateur math types.

Note that I think you are doing "record linkage" which is a stepchild academic
of its own these days. It might bear some research. THere is also a CDC
matching program for text files freely downloadalbe to windows (ack), if you
hunt for it.

For now, my first thought is that you should try a few different matches, maybe
via PL/PGSQL functions, cascading the non-hits to the next step in the process
while shrinking your tables. upcase and delete all spaces, etc. First use
equality on all columns, which should be able to use indices, and separate those
records. Then try equality on a few columns. Then try some super fuzzy regexes
on a few columns. Etc.

You will also have to give some thought to scoring a match, with perfection a
one, but, say, name and birthday the same with all else different a .75, etc.

Also, soundex(), levenshtein, and other fuzzy string tools are your friend. I
want to write a version of SAS's COMPGED for Postgres, but I haven't got round
to it yet.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dan Ancona 2011-03-11 00:43:09 Re: big joins not converging
Previous Message Marti Raudsepp 2011-03-10 22:42:33 Re: Tuning massive UPDATES and GROUP BY's?