Re: Thousands of tables versus on table?

From: david(at)lang(dot)hm
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Scott Marlowe <smarlowe(at)g2switchworks(dot)com>, Thomas Andrews <tandrews(at)soliantconsulting(dot)com>, Mark Lewis <mark(dot)lewis(at)mir3(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Thousands of tables versus on table?
Date: 2007-06-05 22:31:55
Message-ID: Pine.LNX.4.64.0706051507120.24361@asgard.lang.hm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 5 Jun 2007, Tom Lane wrote:

> david(at)lang(dot)hm writes:
>> however I really don't understand why it is more efficiant to have a 5B
>> line table that you do a report/query against 0.1% of then it is to have
>> 1000 different tables of 5M lines each and do a report/query against 100%
>> of.
>
> Essentially what you are doing when you do that is taking the top few
> levels of the index out of the database and putting it into the
> filesystem; plus creating duplicative indexing information in the
> database's system catalogs.
>
> The degree to which this is a win is *highly* debatable, and certainly
> depends on a whole lot of assumptions about filesystem performance.
> You also need to assume that constraint-exclusion in the planner is
> pretty doggone cheap relative to the table searches, which means it
> almost certainly will lose badly if you carry the subdivision out to
> the extent that the individual tables become small. (This last could
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
what is considered 'small'? a few thousand records, a few million records?

what multiplication factor would there need to be on the partitioning to
make it worth while? 100 tables, 1000 tables, 10000 tables?

the company that I'm at started out with a seperate database per customer
(not useing postgres), there are basicly zero cross-customer queries, with
a large volume of updates and lookups.

overall things have now grown to millions of updates/day (some multiple of
this in lookups), and ~2000 customers, with tens of millions of rows
between them.

having each one as a seperate database has really helped us over the years
as it's made it easy to scale (run 500 databases on each server instead of
1000, performance just doubled)

various people (not database experts) are pushing to install Oracle
cluster so that they can move all of these to one table with a customerID
column.

the database folks won't comment much on this either way, but they don't
seem enthusiastic to combine all the data togeather.

I've been on the side of things that said that seperate databases is
better becouse it improves data locality to only have to look at the data
for one customer at a time rather then having to pick out that customer's
data out from the mass of other, unrelated data.

> be improved in some cases if we had a more explicit representation of
> partitioning, but it'll never be as cheap as one more level of index
> search.)

say you have a billing table of
customerID, date, description, amount, tax, extended, paid

and you need to do things like
report on invoices that haven't been paied
summarize the amount billed each month
summarize the tax for each month

but you need to do this seperately for each customerID (not as a batch job
that reports on all customerID's at once, think a website where the
customer can request such reports at any time with a large variation in
criteria)

would you be able to just have one index on customerID and then another on
date? or would the second one need to be on customerID||date?

and would this process of going throught he index and seeking to the data
it points to really be faster then a sequential scan of just the data
related to that customerID?

> I think the main argument for partitioning is when you are interested in
> being able to drop whole partitions cheaply.

I fully understand this if you are doing queries across all the
partitions, but if your query is confined to a single partition,
especially in the case where you know ahead of time in the application
which 'partition' you care about it would seem that searching through
significantly less data should be a win.

David Lang

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message david 2007-06-05 22:58:01 Re: Thousands of tables versus on table?
Previous Message Tom Lane 2007-06-05 22:08:05 Re: performance drop on 8.2.4, reverting to 8.1.4