Re: [HACKERS] Why is that so slow?

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Why is that so slow?
Date: 1999-03-06 14:08:09
Message-ID: 199903061408.XAA00494@ext16.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> writes:
> > postal=> explain select * from postal,prefecture where city ~ '^aaa' and postal.pid = prefecture.pid;
> > NOTICE: QUERY PLAN:
> >
> > Nested Loop (cost=98.90 size=1 width=100)
> > -> Seq Scan on prefecture (cost=2.55 size=47 width=26)
> > -> Index Scan using pidindex on postal (cost=2.05 size=1 width=74)
> >
> > This is so slooow. Can anybody explain this? Am I missing something?
>
> and later:
> > I had defined a btree index on pid and it has 2000
> > duplicate entries in average! After I removed the index, the query
> > runs unbelievably fast! Now explain shows:
>
> > Nested Loop (cost=933.82 size=1 width=100)
> > -> Index Scan using cityindex on postal (cost=931.77 size=1 width=74)
> > -> Index Scan using prefpidindex on prefecture (cost=2.05 size=47 width=26)
>
> Hmm. Removal of the index is just a hack --- the system should have
> been smart enough not to use it. It looks like the system chose the
> first plan shown above because it thought that selecting postal entries
> matching a particular pid value would on average match only one postal
> tuple (note the "size" fields, which are estimates of the numbers of
> resulting tuples). But in reality, each scan produced 2000 matching
> entries on average, according to your second message --- and each of
> those entries had to be tested to see if it had the right city name.
> So, very slow.
>
> The question I have is why didn't the system realize that there would be
> lots of matches on pid? The "dispersion" statistics it keeps ought to
> have given it a clue that this approach wouldn't be very selective.
>
> The second example is fast because the scan over postal looking for city
> name matches finds only one match, so prefecture is scanned only once.

Actulally not only one since I use ~ operator. Anyway matching rows
would be reasonably small.

> However the cost estimates there also look bogus --- the system is again
> mis-guessing how many entries will be selected. It seems to think that
> all 47 prefecture entries will be matched by a scan for a specific pid.
> So, bogus dispersion values again (or bad use of them).
>
> Something is fishy here. Have you done a "vacuum analyze" since loading
> the data in these tables?

Oh, I never thought about that. After re-made the index I removed in
the next letter and did vacuum analyze, I got:

Hash Join (cost=951.50 size=19 width=100)
-> Index Scan using cityindex on postal (cost=944.77 size=19 width=74)
-> Hash (cost=0.00 size=0 width=0)
-> Seq Scan on prefecture (cost=2.55 size=47 width=26)

This plan looks good(and actually as fast as the previous
one). However, the cost estimate for prefecture is again 47?
--
Tatsuo Ishii

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 1999-03-06 16:42:05 Re: [HACKERS] Why is that so slow?
Previous Message Vadim Mikheev 1999-03-06 10:06:48 Re: [HACKERS] Bug on complex join