Re: big joins not converging

From: Steve Atkins <steve(at)blighty(dot)com>
To: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: big joins not converging
Date: 2011-03-10 22:13:02
Message-ID: 6C998CEA-5A42-4DA9-B5E0-33C17C1BF69E@blighty.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:

> Hi postgressers -
>
> As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and sometimes it doesn't, seemingly regardless of how I index things. So I'm looking for general thoughts on the joining of large tables, but also running into a specific issue with the following slightly different query:
>
> This one is between two tables that are a 754k row list of voters and a 445k row list of property owners. (I'm trying to find records where the owner name matches the voter name at the same address.) I have btree single column indices built on all the relevant fields, and multicolumn indices built across all the columns I'm matching. The full schemas of both tables are below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more details below.
>
> This is the query I've come up with so far:
>
> explain analyze
> update vanalameda set ownerflag = 'exact'
> from aralameda where
> vanalameda.streetno ~~ aralameda.streetnum and
> vanalameda.streetname ~~ aralameda.streetname and
> vanalameda.lastname ~~ aralameda.ownername and
> vanalameda.firstname ~~ aralameda.ownername;
>
> If I include the analyze, this didn't complete after running overnight. If I drop the analyze and just explain, I get this:
>
> "Nested Loop (cost=46690.74..15384448712.74 rows=204 width=204)"
> " Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text) AND ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND ((vanalameda.firstname)::text ~~ (aralameda.ownername)::text))"
> " -> Seq Scan on vanalameda (cost=0.00..26597.80 rows=734780 width=204)"
> " -> Materialize (cost=46690.74..58735.87 rows=444613 width=113)"
> " -> Seq Scan on aralameda (cost=0.00..38647.13 rows=444613 width=113)"
>
> One general question: does the width of the tables (i.e. the numbers of columns not being joined and the size of those fields) matter? The tables do have a lot of extra columns that I could slice out.
>

Is there any reason you're using '~~' to compare values, rather than '='?

If you're intentionally using LIKE-style comparisons then there are some other things you can do, but I don't think you mean to do that, for streeno and streetname anyway.

Switching to an equality comparison should let your query use an index, most usefully one on (streetname, streetnum) probably.

I'm not sure what you're intending by comparing ownername to both firstname and lastname. I don't think that'll do anything useful, and doubt it'll ever match. Are you expecting firstname and lastname to be substrings of ownername? If so, you might need to use wildcards with the like.

(Also, performance and smart use of indexes tends to get better in newer versions of postgresql. You might want to upgrade to 9.0.3 too.)

Cheers,
Steve

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Marti Raudsepp 2011-03-10 22:42:33 Re: Tuning massive UPDATES and GROUP BY's?
Previous Message Dan Ancona 2011-03-10 21:25:24 big joins not converging