From: | Gregory Stark <stark(at)enterprisedb(dot)com> |
---|---|
To: | "Craig James" <craig_james(at)emolecules(dot)com> |
Cc: | <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Join with lower/upper limits doesn't scale well |
Date: | 2007-07-03 00:04:46 |
Message-ID: | 87hcom5p41.fsf@oxford.xeocode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
"Craig James" <craig_james(at)emolecules(dot)com> writes:
> Below is the explain/analyze output of the query from each database. Since
> both tables are indexed on the joined columns, I don't understand why the
> big table should be so much slower -- I hoped this would scale well, or at
> least O(log(N)), not O(N).
...
> Sort (cost=431000.85..431248.23 rows=98951 width=363) (actual time=46306.748..46417.448 rows=100000 loops=1)
> Sort Key: r.row_num
> -> Hash Join (cost=2583.59..422790.68 rows=98951 width=363) (actual time=469.010..45752.131 rows=100000 loops=1)
> Hash Cond: ("outer".version_id = "inner".version_id)
> -> Seq Scan on my_molkeys m (cost=0.00..323448.30 rows=5472530 width=363) (actual time=11.243..33299.933 rows=5472532 loops=1)
> -> Hash (cost=2336.21..2336.21 rows=98951 width=8) (actual time=442.260..442.260 rows=100000 loops=1)
> -> Index Scan using i_chm_rownum_row_num on my_rownum r (cost=0.00..2336.21 rows=98951 width=8) (actual time=47.551..278.736 rows=100000 loops=1)
> Index Cond: ((row_num >= 100000) AND (row_num < 200000))
> Total runtime: 46543.163 ms
It looks like most of the time is being spent in the sequential scan of the
my_molkeys at least 33 seconds out of 46 seconds is. The actual sort is taking
under a second (the hash join finishes after 45.7s and the sort finishes after
46.4s).
The rest of the time (about 13s) is actually being spent in the hash join
which makes me think it's overflowing work_mem and having to process the join
in two batches. You might be able to speed it up by raising work_mem for this
query (you can set work_mem locally using SET)
The row_num where clause only narrows down the set of rows coming from the
my_rownum table. If you want sublinear performance you would have to provide
some way for Postgres to narrow down the rows from my_molkeys without actually
having to read them all in.
With the query as is the only way I can see that happening would be if you had
an index on "my_molkey(version_id)" and "my_rownum(version_id) WHERE row_num
between 100000 and 200000". Then it could do a merge join between two index
scans.
Note than even then I'm surprised the optimizer is bothering with the index
for these queries, at least for the 400k case. Do you have enable_seqscan=off
or random_page_cost dialled way down?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Patric de Waha | 2007-07-03 06:05:27 | Delete Cascade FK speed issue |
Previous Message | Craig James | 2007-07-02 23:30:59 | Join with lower/upper limits doesn't scale well |