Re: Re[2]: planner/optimizer hash index method and so on

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Alexey V(dot) Meledin" <avm(at)webclub(dot)ru>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Re[2]: planner/optimizer hash index method and so on
Date: 2000-04-05 16:04:11
Message-ID: 18041.954950651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

"Alexey V. Meledin" <avm(at)webclub(dot)ru> writes:
>>> create index itd_hash_data_id on ifs_tree_default USING HASH (ifs_data_id);
>>> create index iatd_hash_data_id on ifs_account_tree_data USING HASH (ifs_data_id);

TL> Why would you do that? The hash index method doesn't have any advantage
TL> over btree that I can see, and it's got a lot of disadvantages.

> But as I understand from documentation, Hash Join is the preferable
> method of JOIN, because of Seq Scan performed only once (I suppose, that
> it's not full table scan!?!).

Where in the documentation did you see that? If it implies that then
it's wrong. We could certainly simplify the optimizer a lot if it could
just always pick a hash join ;-).

But more to the point, whether a hash join is used has nothing to do
with what kind of index you have. You do not need a hash index to
support a hash join. Hash join means that we build a in-memory hash
table containing the relevant rows from the inner relation (discarding
any that can be rejected due to single-relation WHERE conditions), and
then probe into that table for each row of the outer relation.

> Nested Joins performs too many scans for results in JOIN, containing
> many rows and are so slow on large tables (planner/optimizer problem?).

Could be. The optimizer is a work-in-progress; I wouldn't claim that it
always makes the right choices. But without more details it's hard to
say if it's doing the wrong thing or not.

> 1. What is the difference among "Seq Scan" and "Index Scan" in
> a planer/optimizer query execution plan?

sequential scan (read whole table sequentially) or index scan (use
index to scan just the rows meeting a WHERE condition that matches
the index).

> So, the difference is in ROWS and WIDTH:
> PS: Each table has 100000 rows.
> 1. Nested Loop (cost=4.05 rows=1 width=28)
-> Index Scan using aaa_index_b on aaa a (cost=2.05 rows=1 width=12)
-> Index Scan using bbb_pk on bbb b (cost=2.00 rows=61984 width=16)
> 2. Nested Loop (cost=4.05 rows=1 width=12)
-> Index Scan using aaa_index_b on aaa a (cost=2.05 rows=1 width=4)
-> Index Scan using bbb_pk on bbb b (cost=2.00 rows=1 width=8)

> So, questions:
> 1. can I say, that the second query is more optimal then 1-st?

I'm not sure why 6.5 fails to display a reasonable rows estimate in
the first case. Current sources produce a more reasonable result.
That rows estimate is clearly wrong, or at least inconsistent with
the estimated cost of the indexscan and the estimated number of
rows produced by the join.

> 2. Variants I try:
> 2.1. When I use no indexes, then:
> 2.1.1. Nested Loop (cost=2442.50 rows=1 width=28)
-> Seq Scan on aaa a (cost=2440.50 rows=1 width=12)
-> Index Scan using bbb_pk on bbb b (cost=2.00 rows=61984 width=16)
> 2.1.2. Nested Loop (cost=2442.50 rows=1 width=12)
-> Seq Scan on aaa a (cost=2440.50 rows=1 width=4)
-> Index Scan using bbb_pk on bbb b (cost=2.00 rows=1 width=8)

> 2.2. I try to set indexes on a.a and b.a, supposing that it helps a
> bit on relation "a.a=b.a"
> 2.2.1. Nested Loop (cost=3355.28 rows=1 width=28)
-> Seq Scan on aaa a (cost=3353.28 rows=1 width=12)
-> Index Scan using bbb_pk on bbb b (cost=2.00 rows=85159 width=16)
> 2.2.2. Nested Loop (cost=3355.28 rows=1 width=12)
-> Seq Scan on aaa a (cost=3353.28 rows=1 width=4)
-> Index Scan using bbb_pk on bbb b (cost=2.00 rows=1 width=8)

> Perfomace become lower!! Why?

Did performance *actually* change? I don't see how it could,
considering that the plan is the same. I hope you're not confusing
the planner's cost estimates with reality ;-).

If you're wondering why the estimate changed, it's probably because
CREATE INDEX updates the planner's information about number of rows
and number of disk pages the relation has. I guess you loaded
more data since your last VACUUM.

> 2.4. I add indexes on a.a and a.b
> Nothing happens!

It's already using an index that covers column b, and the index on
column a doesn't help because it can only use one index in a scan.

regards, tom lane

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Alexey V. Meledin 2000-04-05 16:42:53 Re[2]: Re[2]: planner/optimizer hash index method and so on
Previous Message Oleg Broytmann 2000-04-05 14:35:21 Re: Is Pgsql an OORDBMS?