Re: slow queries on large syslog table

From: Antonio Fiol Bonnín <fiol(at)w3ping(dot)com>
To: colm ennis <colm(dot)ennis(at)eircom(dot)net>
Cc: PostgreSQL General Mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: slow queries on large syslog table
Date: 2001-12-14 08:14:05
Message-ID: 3C19B4CD.5080809@w3ping.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi again!

For a table that contains an URL id (~2000 different ones), a zone id
(max. 30 different ones), a timestamp (one every 3 hours for every pair
(url, zone) and 5 results columns for the combination (url, zone, time).

So, I get about 2000*(30/3)*1 INSERTs on that table every 3 hours. That
makes a 5 Million tuples table of it by now. 1/3 is an estimative
factor, as not every couple (url, zone) is valid.

For queries, I always ask the same type of information, more or less:

SELECT * FROM T WHERE stamp BETWEEN xxxxx AND yyyyy AND (zone=zz OR
zone=ww [...]) AND url=uuuu;
where xxxxx, yyyyy, zz, ww and uuuu are arbitrary constants of the
needed types.

Requests from that table take about 1 second. An index using all three
columns (url, zone, stamp), in that order is created.

If that index is not present, or is the wrong order, nothing works
(requests take for ages).

WHICH LEADS ME TO THINK THAT I GAVE YOU THE WRONG ANSWER!!!!!

Put the timestamp as the last column at the index.

It is MUCH better if the first columns match as EXACTLY as possible. If
an index matches a range, results may be good, BUT if it matches a
CONSTANT, results are wonderful!!

Well, maybe I am going a bit too far with my last sentence, but you get
it, don't you?

With the idea about the phone book again. You were true that you'd
better go directly to the page containing the exact data you need than
searching for any data range at all.

BUT FORGET about the index size, unless space is a major limitation for
you. These days, disk space is cheap ;-)

Sincerely, It is really worth trying.

Antonio

colm ennis wrote:

>hi all,
>
>thanks for your help, its comforting but also kinda scary to know
>im not the only one whos having trouble!
>
>in response to questions....
>
>as i mentioned before, the syslog_table is currently holds about
>1.7 million rows and is constantly slowly growing, the hostid_table
>and ciscomdgid_table each hold about 80 rows.
>
>the number of rows returned when i ran the query below :
> SELECT stimestamp,shostid,smsg FROM syslog_table WHERE (shostid IN (23,3))
>AND (sciscomsgid IN (41,32,70)) ORDER BY stimestamp DESC LIMIT 1000
>was 19 in all cases, so i guess? the row estimations are woefully
>inaccurate.
>
>i ran a vacuum analyse a few minutes prior to trying these queries.
>
>my postgresql is version 7.1.3, is running on freebsd 4.3 and is the
>standard ports install.
>
>with regard to resources, heres the output of top mid select :
> last pid: 77402; load averages: 0.17, 0.08, 0.03
>up 87+05:05:42 23:36:25
> 48 processes: 2 running, 46 sleeping
> CPU states: 11.6% user, 0.0% nice, 5.4% system, 0.0% interrupt, 82.9%
>idle
> Mem: 47M Active, 139M Inact, 52M Wired, 8784K Cache, 35M Buf, 656K Free
> Swap:
>
> PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
> 77400 pgsql 2 0 5956K 3568K RUN 0:04 21.08% 12.94% postgres
> 77399 www 2 0 7896K 6776K select 0:01 2.32% 1.46% perl
> 77257 pgsql 2 0 7292K 4832K sbwait 0:04 0.05% 0.05% postgres
> 65374 root 10 0 3440K 2696K nanslp 41:00 0.00% 0.00% perl
> 74942 pgsql 2 0 8048K 5876K sbwait 8:46 0.00% 0.00% postgres
> 75116 root 2 0 2148K 1124K poll 1:30 0.00% 0.00% syslogd
>(hmm ...seems like no swap device is configured, but theres loads of
>inactive pages anyway i guess)
>
>memory :
> hw.physmem: 264351744
>cpu :
> CPU: Pentium III/Pentium III Xeon/Celeron (547.18-MHz 686-class CPU)
>disk :
> 1 x 9gig scsi
>
>im not using the -B option so i guess im using the3 default number/size
>buffers.
>
>hope this helps!
>
>thanks again for all your help a i am completely clueless!
>
>colm ennis
>
>-----Original Message-----
>From: Antonio Fiol Bonnin [mailto:fiol(at)w3ping(dot)com]
>Sent: 13 December 2001 18:14
>To: colm ennis
>Cc: PostgreSQL General Mailing list
>Subject: Re: [GENERAL] slow queries on large syslog table
>
>
>Are the rows estimations "real"?
>
>
>colm ennis wrote:
>
>>hi antonio,
>>
>>thanks for your advice.
>>
>>ive tried a lot of different index combinations, with extremely variable
>>results,
>>for instance :
>> query - SELECT stimestamp,shostid,smsg FROM syslog_table WHERE (shostid IN
>>(23,3)) AND (sciscomsgid IN (41,32,70)) ORDER BY stimestamp DESC LIMIT 1000
>> matching messages - 19
>>
>>with original indexes :
>> query time(s) - 225
>> explain - Limit (cost=0.00..34559.46 rows=1000 width=24)
>> -> Index Scan Backward using syslog_table_stimestamp_index on
>>syslog_table (cost=0.00..577149.86 rows=16700 width=24)
>>
>>with antonios index :
>> create index syslog_table_stimestamp_shostid_sciscomsgid_index on
>>syslog_table (stimestamp, shostid, sciscomsgid);
>> query time(s) - 174
>> explain - Limit (cost=0.00..34329.14 rows=1000 width=24)
>> -> Index Scan Backward using syslog_table_st_sh_sc_index on syslog_table
>>(cost=0.00..580639.57 rows=16914 width=24)
>>
>>with NO! index :
>> query time(s) - 77
>> explain - Limit (cost=73979.79..73979.79 rows=1000 width=24)
>> -> Sort (cost=73979.79..73979.79 rows=16905 width=24)
>> -> Seq Scan on syslog_table (cost=0.00..72591.62 rows=16905
>>width=24)
>>
>>i got similarily confusing results from other queries.
>>
>>it occured to me that that the index antonio suggests is going to be huge
>>because
>>of the per second timestamp.
>>
>>i thought about what you said about deciding how I would search for data.
>>
>>for the query above :
>> SELECT stimestamp,shostid,smsg FROM syslog_table WHERE (shostid IN (23,3))
>>AND (sciscomsgid IN (41,32,70)) ORDER BY stimestamp DESC LIMIT 1000
>>i would lookup shostid and sciscomsgid in a combined index, and then load
>>the
>>indexed rows in syslog_table. there are about 80 different shostids and
>>sciscomsgids, so i guess the max size of this index would be 1600 rows.
>>
>>for other queries i would use an shostid index or sciscomsgid if just one
>>
>of
>
>>these
>>fields appeared in the select, avoiding the stimestamp at all cost because
>>its index
>>will be huge.
>>
>>so to test if this was any good i created the combined index :
>> create index syslog_table_sh_sc_index on syslog_table (shostid,
>>sciscomsgid);
>>but using explain found it isnt being used? :
>> explain - Limit (cost=74018.18..74018.18 rows=1000 width=24)
>> -> Sort (cost=74018.18..74018.18 rows=16914 width=24)
>> -> Seq Scan on syslog_table (cost=0.00..72629.33 rows=16914
>>width=24)
>>
>>i still dont understand how to use indexes to increase the speed of
>>
>queries.
>
>>thanks for your help so far but i still feel lost,
>>
>>colm ennis
>>
>>-----Original Message-----
>>From: Antonio Fiol Bonnin [mailto:fiol(at)w3ping(dot)com]
>>Sent: 13 December 2001 15:29
>>To: colm ennis; PostgreSQL General Mailing list
>>Subject: Re: [GENERAL] slow queries on large syslog table
>>
>>
>>>
>>>ive also created a web interface for selecting syslogs based on optional
>>>combinations of timestamp, hostname and ciscomsg.
>>>
>>Combinations is the *magic* word.
>>
>>>to speed queries i created some indexes on syslog_table :
>>> create index syslog_table_stimestamp_index on syslog_table (stimestamp);
>>> create index syslog_table_shostid_index on syslog_table (shostid);
>>> create index syslog_table_sciscomsgid_index on syslog_table
>>>
>(sciscomsgid);
>
>>> create index syslog_table_shostid_sciscomsgid_index on syslog_table
>>>(shostid,sciscomsgid);
>>>
>>Most of them are of no use. That's what you observed...
>>
>>I bet you will get much better perfs with:
>>
>>create index syslog_table_stimestamp_shostid_sciscomsg_index on
>>
>syslog_table
>
>>(stimestamt, shostid, sciscomsg);
>>
>>You can try other combinations, but the one I suggested should be of use
>>in case you use all three on the query, (or even if you use only the
>>first, or the first two, though not sure about this last part, in
>>parentheses).
>>
>>
>>Believe me. Creating an index on a large table is of no use, unless it
>>is the right one. As a rule of thumb, include in the index as many of
>>the SELECTIVE columns present in the WHERE clause as you can.
>>
>>I am not sure of the selectivity of your columns (never used that
>>particular structure). There should be some information about that on
>>some of the system tables. However, I do not know in which, or how to
>>get that info.
>>
>>As a second rule of thumb, think how YOU would search for the data you
>>need if it was written on a paper book, and especially, how you would
>>like to find the book ordered. For example, if you were to look the
>>address corresponding to a phone number, you would like to find the data
>>ordered by phone number, and not by name. Name is not of any use to you.
>>
>>For selectivity, think of finding the phone numbers of all people that
>>live at number 5, but of any street, and whose first name is Peter.
>>
>>Neither "Peter" nor "5" are REALLY useful informations to perform your
>>search. And even, Peter is more useful than 5.
>>
>>HTH,
>>
>>Antonio
>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 6: Have you searched our list archives?
>>
>>http://archives.postgresql.org
>>
>>.
>>
>
>
>
>
>.
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Anil Kumar Narule 2001-12-14 08:42:52 Display of Japanese character
Previous Message Huang Ming 2001-12-14 07:52:58 Re: [CORE] report a bug