Re: Maximum number of tables

From: Manos Karpathiotakis <mk(at)di(dot)uoa(dot)gr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Maximum number of tables
Date: 2011-03-16 14:35:57
Message-ID: AANLkTikuwoiC34W54VDVX4SZgN2PQHHQrJ-aKSvYLN0x@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Let me explain a few things about our dataset. We are using a system named
Sesame [1] that stores and queries RDF data. In our case, it uses Postgres
as a relational backend. In RDF, data are triples. Here is an example of an
RDF triple:

ex:Postgres rdf:type ex:RDBMS

Triples are stored in Postgres. We can use two storing schemes.

A monolithic scheme where every triple is stored in 1 huge table: Triple(int
subject, int predicate, int object) and some additional information is
stored in other tables of the form dictionary(int id, string
original_value). This results in a schema with approximately 10 tables.

A per-predicate scheme can also be used. This storing scheme creates a table
for every distinct predicate. For example, to store the aforementioned
triple, we would create a table type(int subject, int object) and we would
insert a tuple with the encoded values for ex:Postgres and ex:RDBMS.

Queries for RDF data can be expressed in the SPARQL query language. Sesame
translates SPARQL queries to SQL queries depending on the storing scheme
being used. So, you can imagine that when we use the monolithic storing
scheme, queries would be translated to an SQL query with many self-joins on
a huge triple table. On the other hand, if we use the predicate schema,
SPARQL queries are translated to many joins between smaller tables.

In our case, we want to store 111M triples (and we would like to experiment
with even datasets) that consists of approximately 10.000 distinct
predicates. This means that when we would strongly prefer to use the
per-predicate storing scheme to have faster query execution. However we are
not experienced in tuning Postgres for this kind of data.

Until now, we disabled WAL, disabled autocommit, increased shared buffers to
512mb, temp buffers to 64mb but we haven't tried disabling indices and
foreign key constraints because it would require some code restructuring (I
understand however the performance increase that we would get if we disabled
them).

We are currently loading the dataset in chunks that perform 24.000.000
insertions to the database.

Could you suggest some values for shared buffers, temp
buffers, maintenance_work_mem, checkpoint_segments or other relevant
parameters that we could use as a starting point?

Best Regards,
Manos Karpathiotakis

[1] http://www.openrdf.org/

On Tue, Mar 15, 2011 at 7:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Manos Karpathiotakis <mk(at)di(dot)uoa(dot)gr> writes:
> > Hi all,
> > I am using postgres 9.0 under CentOS 5.5 (Dual Xeon Quad Core @ 2,44 GHz,
> > 64GB RAM, 2TB RAID 5). In my case, postgres is used as a backend for the
> RDF
> > store Sesame.
> > I am trying to store a dataset that consists of approximately 900.000.000
> > insertions (organized in either 10 tables, or in an arbitrary number of
> > tables exceeding 2000). Loading the data in 10 tables takes about two
> days,
> > while in the 2nd case, an org.postgresql.util.PSQLException: ERROR: out
> of
> > shared memory error is thrown.
>
> I'm guessing that you're seeing this through some client-side code that
> helpfully suppresses the HINT about raising max_locks_per_transaction
> :-(
>
> If your application tries to touch all 2000 tables in one transaction,
> you will need to raise that parameter to avoid running out of locktable
> space for the AccessShareLocks it needs on all those tables.
>
> However ... generally speaking, I'd guess that whatever you did to
> refactor 10 tables into 2000 was a bad idea --- one table with an extra
> key column is almost always a better design than N basically-identical
> tables. The latter will be a *lot* harder to use, and probably won't
> fix your performance problem anyway. You should attack the performance
> problem in a different way. Have you read
> http://www.postgresql.org/docs/9.0/static/populate.html ?
>
> regards, tom lane
>

--
===================================================
Manos Karpathiotakis
National & Kapodistrian University of Athens
Department of Informatics & Telecommunications, Office B25
Management of Data & Information Knowledge Group
Panepistimioupolis, Ilissia
GR-15784 Athens, Greece
Tel: +30.210.727.5159
Fax: +30.210.727.5214
e-mail: mk(at)di(dot)uoa(dot)gr
===================================================

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2011-03-16 14:54:34 Re: equivalent of mysql's SET type?
Previous Message Igor Neyman 2011-03-16 14:05:25 Re: A join of 2 tables with sum(column) > 30