From: | Joshua Marsh <icub3d(at)gmail(dot)com> |
---|---|
To: | pg(at)fastcrypt(dot)com |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Large Database Performance suggestions |
Date: | 2004-10-26 15:24:08 |
Message-ID: | 38242de904102608246d060f7d@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Thanks for all of your help so far. Here is some of the information
you guys were asking for:
Test System:
2x AMD Opteron 244 (1.8Ghz)
8GB RAM
7x 72GB SCSI HDD (Raid 5)
postrgesql.conf information:
#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------
# - Memory -
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
#sort_mem = 1024 # min 64, size in KB
#vacuum_mem = 8192 # min 1024, size in KB
sort_mem = 4096000
vacuum_mem = 1024000
# - Free Space Map -
#max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000 # min 100, ~50 bytes each
# - Kernel Resource Usage -
#max_files_per_process = 1000 # min 25
#preload_libraries = ''
#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------
# - Settings -
#fsync = true # turns forced synchronization on or off
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or open_datasync
#wal_buffers = 8 # min 4, 8KB each
# - Checkpoints -
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300 # range 30-3600, in seconds
#checkpoint_warning = 30 # 0 is off, in seconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
Everything else are at their defaults. I actually think the WAL
options are set to defaults as well, but I don't recall exactly :)
As for the queries and table, The data we store is confidential, but
it is essentially an account number with a bunch of boolean fields
that specify if a person applies to criteria. So a query may look
something like:
SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH';
which is explained as something like this:
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on view_of_data (cost=0.00..25304.26 rows=22054 width=11)
Filter: (has_name AND is_active_member AND ((state)::text = 'OH'::text))
(2 rows)
Occasionally, because we store data from several sources, we will have
requests for data from several sources. We simply intersect the
view_of_data table with a sources table that lists what acctno belong
to what source. This query would look something like this:
SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH' INTERSECT SELECT acctno FROM sources_data WHERE
source = 175;
which is explained as follows:
QUERY PLAN
-------------------------------------------------------------------------------------------
SetOp Intersect (cost=882226.14..885698.20 rows=69441 width=11)
-> Sort (cost=882226.14..883962.17 rows=694411 width=11)
Sort Key: acctno
-> Append (cost=0.00..814849.42 rows=694411 width=11)
-> Subquery Scan "*SELECT* 1" (cost=0.00..25524.80
rows=22054 width=11)
-> Seq Scan on view_of_data
(cost=0.00..25304.26 rows=22054 width=11)
Filter: (has_name AND is_active_member AND
((state)::text = 'OH'::text))
-> Subquery Scan "*SELECT* 2" (cost=0.00..789324.62
rows=672357 width=11)
-> Seq Scan on sources_data
(cost=0.00..782601.05 rows=672357 width=11)
Filter: (source = 23)
Again, we see our biggest bottlenecks when we get over about 50
million records. The time to execute grows exponentially from that
point.
Thanks again for all of your help!
-Josh
On Fri, 22 Oct 2004 07:38:49 -0400, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> Josh,
>
> Your hardware setup would be useful too. It's surprising how slow some
> big name servers really are.
> If you are seriously considering memory sizes over 4G you may want to
> look at an opteron.
>
> Dave
>
>
>
> Joshua Marsh wrote:
>
> >Hello everyone,
> >
> >I am currently working on a data project that uses PostgreSQL
> >extensively to store, manage and maintain the data. We haven't had
> >any problems regarding database size until recently. The three major
> >tables we use never get bigger than 10 million records. With this
> >size, we can do things like storing the indexes or even the tables in
> >memory to allow faster access.
> >
> >Recently, we have found customers who are wanting to use our service
> >with data files between 100 million and 300 million records. At that
> >size, each of the three major tables will hold between 150 million and
> >700 million records. At this size, I can't expect it to run queries
> >in 10-15 seconds (what we can do with 10 million records), but would
> >prefer to keep them all under a minute.
> >
> >We did some original testing and with a server with 8GB or RAM and
> >found we can do operations on data file up to 50 million fairly well,
> >but performance drop dramatically after that. Does anyone have any
> >suggestions on a good way to improve performance for these extra large
> >tables? Things that have come to mind are Replication and Beowulf
> >clusters, but from what I have recently studied, these don't do so wel
> >with singular processes. We will have parallel process running, but
> >it's more important that the speed of each process be faster than
> >several parallel processes at once.
> >
> >Any help would be greatly appreciated!
> >
> >Thanks,
> >
> >Joshua Marsh
> >
> >P.S. Off-topic, I have a few invitations to gmail. If anyone would
> >like one, let me know.
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
> >
> >
> >
> >
>
> --
> Dave Cramer
> http://www.postgresintl.com
> 519 939 0336
> ICQ#14675561
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2004-10-26 15:30:23 | Re: [PATCHES] ARC Memory Usage analysis |
Previous Message | Simon Riggs | 2004-10-26 12:18:25 | Re: [PATCHES] ARC Memory Usage analysis |