| From: | "Michael Richards" <michael(at)fastmail(dot)ca> | 
|---|---|
| To: | pgsql-sql(at)postgresql(dot)org | 
| Subject: | Re: Indexes on functions | 
| Date: | 2001-10-18 01:52:22 | 
| Message-ID: | 3BCE35D6.00000B.93192@frodo.searchcanada.ca | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-sql | 
I'm going to write a function in C that parses XML. Is there any way 
to index the output of this function? I've got 10,000,000 rows that 
contain XML data and I need to efficiently find the ones that contain 
the proper keys. I tried pulling the values I want from the data and 
putting it in its own table but with an average of 20 items that just 
grows to an unmanageable size.
-Michael
_________________________________________________________________
     http://fastmail.ca/ - Fast Free Web Email for Canadians
>From pgsql-sql-owner(at)postgresql(dot)org  Thu Oct 18 09:40:33 2001
Received: from fw-ext.econet.hu (fw.econet.hu [213.16.96.5])
	by postgresql.org (8.11.3/8.11.4) with ESMTP id f9HENGP37202
	for <pgsql-sql(at)postgresql(dot)org>; Wed, 17 Oct 2001 10:23:16 -0400 (EDT)
	(envelope-from col(at)mportal(dot)hu)
Received: from mail.econet.hu ([172.17.1.17])
	by fw-ext.econet.hu with esmtp (Exim 3.12 #1 (Debian))
	id 15trgk-000242-00; Wed, 17 Oct 2001 16:28:54 +0200
Received: from host152.localnet3 ([192.168.12.152] helo=mportal.hu)
	by mail.econet.hu with esmtp (Exim 3.12 #1 (Debian))
	id 15trb4-0003hG-00; Wed, 17 Oct 2001 16:23:02 +0200
Message-ID: <3BCD9582(dot)1070804(at)mportal(dot)hu>
Date: Wed, 17 Oct 2001 16:28:18 +0200
From: CoL <col(at)mportal(dot)hu>
User-Agent: Mozilla/5.0 (Windows; U; Win95; en-US; rv:0.9.2) Gecko/20010726 Netscape6/6.1
X-Accept-Language: en-us
MIME-Version: 1.0
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
CC: pgsql-sql(at)postgresql(dot)org
Subject: Re: index problem
References: <Pine(dot)BSF(dot)4(dot)21(dot)0110161548400(dot)18471-100000(at)megazone23(dot)bigpanda(dot)com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanner: exiscan *15trb4-0003hG-00*YiTaXybQ75Y* http://duncanthrax.net/exiscan/
X-Archive-Number: 200110/249
X-Sequence-Number: 5204
Hi,
Stephan Szabo wrote:
> On Tue, 16 Oct 2001, CoL wrote:
> 
> 
>>---------------------------
>>The 2 table query, where prog_data has ~8800 rowsn and index on prog_id:
>>bash-2.04$ time echo "explain select distinct 
>>prog_id,prog_ftype,prog_fcasthour,prog_date from prog_dgy_xy,prog_data 
>>where pxygy_pid=prog_id " | psql -Uuser db
>>NOTICE:  QUERY PLAN:
>>
>>Unique  (cost=7432549.69..7680455.07 rows=2479054 width=32)
>>   ->  Sort  (cost=7432549.69..7432549.69 rows=24790538 width=32)
>>         ->  Merge Join  (cost=148864.65..161189.33 rows=24790538 width=32)
>>               ->  Index Scan using prog_data_pkey on prog_data 
>>(cost=0.00..701.12 rows=8872 width=28)
>>               ->  Sort  (cost=148864.65..148864.65 rows=921013 width=4)
>>                     ->  Seq Scan on prog_dgy_xy  (cost=0.00..30145.13 
>>rows=921013 width=4)
>>
> 
> I'm guessing that the approximately 25 million row estimate on the join
> has to be wrong as well given that prog_data.prog_id should be unique.
> 
> Hmm, does the explain change if you vacuum analyze the other table
> (prog_data)?  If not, what does explain show if you do a
> set enable_seqscan='off';
> before it?
The result:
db=>set enable_seqscan='off';
db=>explain select distinct prog_id,prog_ftype,prog_fcasthour,prog_date 
from prog_dgy_xy,prog_data  where pxygy_pid=prog_id;
NOTICE:  QUERY PLAN:
Unique  (cost=7606982.10..7854887.48 rows=2479054 width=32)
   ->  Sort  (cost=7606982.10..7606982.10 rows=24790538 width=32)
         ->  Merge Join  (cost=0.00..335621.73 rows=24790538 width=32)
               ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
(cost=0.00..323297.05 rows=921013 width=4)
               ->  Index Scan using prog_data_pkey on prog_data 
(cost=0.00..701.12 rows=8872 width=28)
It "seems" index is used, but the same result :(((, and bigger execution 
time: real   3m41.830s
What is in tables?
prog_data contains unique id and other info.
prog_dgy_xy contains that id with x,y coordinates (so many ids from 
prog_data with unique x,y)
#prog_data:
#prog_id, prog_ftype, prog_fcasthour, prog_date
#1 
'type'        6	              2001-10-14 12:00:00
#2 
'type'        12              2001-10-14 12:00:00
#prog_dgy_xy:
#pxygy_pid, pxygy_x, pxygy_y
#1 
     0.1       0.1
#1          0.1       0.15
How can this query takes real    0m1.755s for mysql, [17 sec for 
oracle], and 2-3 minutes!! for postgres?
And why:
POSTGRES:
set enable_seqscan ='off'; select count(*) from prog_dgy_xy where 
pxygy_pid<13161;
  count
--------
  900029
real    2m34.340s
explain:
Aggregate  (cost=327896.89..327896.89 rows=1 width=0)
   ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
(cost=0.00..325594.54 rows=920940 width=0)
MYSQL:
select count(pxygy_pid) from PROG_DGY_XY where pxygy_pid<13161
count(pxygy_pid)
900029
real    0m27.878s
explain:
table   type    possible_keys   key     key_len ref     rows    Extra
PROG_DGY_XY     range   progdgyxy_idx1,progdgyxy_idx2   progdgyxy_idx2 
4       NULL    906856  where used; Using index
The same time difference in case of: = or >, however explain says, cause 
seq scan is off, the index is used.
I did vacuum, and vacuum analyze too before.
PS: I think i have to make a site for that, cause there are many 
questions :), and weird things.
I love postgres but this makes me "hm?". Today i'll make these test 
under 7.1.2.
thx
CoL
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Henshall, Stuart - WCP | 2001-10-18 13:28:31 | Re: Deleting obsolete values | 
| Previous Message | Oleg Lebedev | 2001-10-17 22:32:57 | SQL parser and/or optimizer |