| From: | Bruno Wolff III <bruno(at)wolff(dot)to> | 
|---|---|
| To: | Kim Bisgaard <kib+pg(at)dmi(dot)dk> | 
| Cc: | pgsql-performance(at)postgresql(dot)org | 
| Subject: | Re: full outer performance problem | 
| Date: | 2005-06-08 12:17:55 | 
| Message-ID: | 20050608121755.GB11464@wolff.to | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-performance | 
On Wed, Jun 08, 2005 at 11:37:40 +0200,
  Kim Bisgaard <kib+pg(at)dmi(dot)dk> wrote:
> Hi,
> 
> I'm having problems with the query optimizer and FULL OUTER JOIN on 
> PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. 
> I might be naive, but I think that it should be possible?
> 
> I have two BIG tables (virtually identical) with 3 NOT NULL columns 
> Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id, 
> TimeObs) and valid ANALYSE (set statistics=100). I want to join the two 
> tables with a FULL OUTER JOIN.
> 
> When I specify the query as:
> 
> SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
>        FROM temp_dry_at_2m a
>        FULL OUTER JOIN temp_grass b 
>        USING (station_id, timeobs)
>        WHERE station_id = 52981
>          AND timeobs = '2004-1-1 0:0:0'
> 
> I get the correct results
> 
> station_id |       timeobs       | temp_grass | temp_dry_at_2m
> ------------+---------------------+------------+----------------
>      52944 | 2004-01-01 00:10:00 |            |           -1.1
> (1 row)
> 
> BUT LOUSY performance, and the following EXPLAIN:
> 
>                                                                                  QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Full Join  (cost=1542369.83..1618958.58 rows=6956994 width=32) 
> (actual time=187176.408..201436.264 rows=1 loops=1)
>   Merge Cond: (("outer".station_id = "inner".station_id) AND 
>   ("outer".timeobs = "inner".timeobs))
>   Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND 
>   (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 
>   00:00:00'::timestamp without time zone))
>   ->  Sort  (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual 
>   time=145748.253..153851.607 rows=6956994 loops=1)
>         Sort Key: a.station_id, a.timeobs
>         ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..117549.94 
>         rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994 
>         loops=1)
>   ->  Sort  (cost=334456.38..340472.11 rows=2406292 width=16) (actual 
>   time=31668.876..34491.123 rows=2406292 loops=1)
>         Sort Key: b.station_id, b.timeobs
>         ->  Seq Scan on temp_grass b  (cost=0.00..40658.92 rows=2406292 
>         width=16) (actual time=0.052..5484.489 rows=2406292 loops=1)
> Total runtime: 201795.989 ms
> (10 rows)
Someone else will need to comment on why Postgres can't use a more
efficient plan. What I think will work for you is to restrict
the station_id and timeobs on each side and then do a full join.
You can try something like the sample query below (which hasn't been tested):
SELECT station_id, timeobs, temp_grass, temp_dry_at_2m
  FROM
    (SELECT station_id, timeobs, temp_dry_at_2m
      FROM temp_dry_at_2m
      WHERE
        station_id = 52981
        AND
        timeobs = '2004-1-1 0:0:0') a
    FULL OUTER JOIN
    (SELECT station_id, timeobs, temp_grass
      FROM temp_grass
      WHERE
        station_id = 52981
        AND
        timeobs = '2004-1-1 0:0:0') b
    USING (station_id, timeobs)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Martin Fandel | 2005-06-08 12:25:52 | Re: Filesystem | 
| Previous Message | Michael Stone | 2005-06-08 12:10:10 | Re: Filesystem |