strategies for optimizing read on rather large tables

From: hubert lubaczewski <hubert(dot)lubaczewski(at)eo(dot)pl>
To: pgsql-performance(at)postgresql(dot)org
Subject: strategies for optimizing read on rather large tables
Date: 2005-06-04 08:17:42
Message-ID: 20050604081742.GA29123@eo.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

hi
first let me draw the outline.
we have a database which stores "adverts".
each advert is in one category, and one or more "region".
regions and categories form (each) tree structure.
assume category tree:

a
/ \
b c
/ \
d e

if any given advert is in category "e". it means it is also in "b" and
"a".
same goes for regions.

as for now we have approx. 400 categories, 1300 regions, and 1000000
adverts.

since checking always over the tress of categories and regions we
created acr_cache table (advert/category/region)
which stores information on all adverts and all categories and regions
this particular region is in.
plus some more information for sorting purposes.

this table is ~ 11 milion records.

now.
we query this in more or less this manner:

select advert_id from acr_cache where category_id = ? and region_id = ?
order by XXX {asc|desc} limit 20;

where XXX is one of 5 possible fields,
timestamp,
timestamp,
text,
text,
numeric

we created index on acr_cache (category_id, region_id)
and it works rather well.
usually.
if a given "crossing" (category + region) has small amount of ads (less
then 10000) - the query is good enough (up to 300 miliseconds).
but when we enter the crossings which result in 50000 ads - the query
takes up to 10 seconds.
which is almost "forever".

we thought about creating indices like this:
index on acr_cache (effective_date);
where effective_dateis on of the timestamp fields.
it worked well for the crossings with lots of ads, but when we asked for
small crossing (like 1000 ads) it took > 120 seconds!
it appears that postgresql was favorizing this new advert instead of
using much better index on category_id and region_id.

actually - i'm not sure what to do next.
i am even thinkinh about createing special indices (partial) for big
crossings, but that's just weird. plus the fact that already the
acr_cache vacuum time exceeds 3 hours!.

any suggestions?
hardware is dual xeon 3 ghz, 4G ram, hardware scsi raid put into raid 1.
settings in postgresql.conf:
listen_addresses = '*'
port = 5800
max_connections = 300
superuser_reserved_connections = 50
shared_buffers = 131072
work_mem = 4096
maintenance_work_mem = 65536
fsync = false
commit_delay = 100
commit_siblings = 5
checkpoint_segments = 10
effective_cache_size = 10000
random_page_cost = 1.1
log_destination = 'stderr'
redirect_stderr = true
log_directory = '/home/pgdba/logs'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = false
log_rotation_age = 1440
log_rotation_size = 502400
log_min_duration_statement = -1
log_connections = true
log_duration = true
log_line_prefix = '[%t] [%p] <%u(at)%d> '
log_statement = 'all'
stats_start_collector = true
stats_command_string = true
stats_block_level = true
stats_row_level = true
stats_reset_on_server_start = true
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'

actual max numer of connection is 120 plus some administrative connections (psql sessions).
postgresql version 8.0.2 on linux debian sarge.

best regards,

depesz

--
hubert lubaczewski
Network Operations Center
eo Networks Sp. z o.o.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message philb 2005-06-04 10:04:10 Re: Query plan for very large number of joins
Previous Message Junaili Lie 2005-06-04 01:52:27 Re: Forcing use of specific index