Re: Maximum number of tables

From: Dmitriy Igrishin <dmitigr(at)gmail(dot)com>
To: Manos Karpathiotakis <mk(at)di(dot)uoa(dot)gr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: Maximum number of tables
Date: 2011-03-16 20:17:50
Message-ID: AANLkTikhPm0+M0dOZ_25JraAdjBc6+M2BeC94wspBoEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey Manos,

2011/3/16 Manos Karpathiotakis <mk(at)di(dot)uoa(dot)gr>

> Let me explain a few things about our dataset. We are using a system named
> Sesame [1] that stores and queries RDF data. In our case, it uses Postgres
> as a relational backend. In RDF, data are triples. Here is an example of an
> RDF triple:
>
> ex:Postgres rdf:type ex:RDBMS
>
> Triples are stored in Postgres. We can use two storing schemes.
>
> A monolithic scheme where every triple is stored in 1 huge table:
> Triple(int subject, int predicate, int object) and some additional
> information is stored in other tables of the form dictionary(int id, string
> original_value). This results in a schema with approximately 10 tables.
>
Have you considered to use hstore in this case?
http://www.postgresql.org/docs/9.0/static/hstore.html

>
> A per-predicate scheme can also be used. This storing scheme creates a
> table for every distinct predicate. For example, to store the aforementioned
> triple, we would create a table type(int subject, int object) and we would
> insert a tuple with the encoded values for ex:Postgres and ex:RDBMS.
>
> Queries for RDF data can be expressed in the SPARQL query language. Sesame
> translates SPARQL queries to SQL queries depending on the storing scheme
> being used. So, you can imagine that when we use the monolithic storing
> scheme, queries would be translated to an SQL query with many self-joins on
> a huge triple table. On the other hand, if we use the predicate schema,
> SPARQL queries are translated to many joins between smaller tables.
>
> In our case, we want to store 111M triples (and we would like to experiment
> with even datasets) that consists of approximately 10.000 distinct
> predicates. This means that when we would strongly prefer to use the
> per-predicate storing scheme to have faster query execution. However we are
> not experienced in tuning Postgres for this kind of data.
>
> Until now, we disabled WAL, disabled autocommit, increased shared buffers
> to 512mb, temp buffers to 64mb but we haven't tried disabling indices and
> foreign key constraints because it would require some code restructuring (I
> understand however the performance increase that we would get if we disabled
> them).
>
> We are currently loading the dataset in chunks that perform 24.000.000
> insertions to the database.
>
> Could you suggest some values for shared buffers, temp
> buffers, maintenance_work_mem, checkpoint_segments or other relevant
> parameters that we could use as a starting point?
>
> Best Regards,
> Manos Karpathiotakis
>
> [1] http://www.openrdf.org/
>
>
> On Tue, Mar 15, 2011 at 7:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Manos Karpathiotakis <mk(at)di(dot)uoa(dot)gr> writes:
>> > Hi all,
>> > I am using postgres 9.0 under CentOS 5.5 (Dual Xeon Quad Core @ 2,44
>> GHz,
>> > 64GB RAM, 2TB RAID 5). In my case, postgres is used as a backend for the
>> RDF
>> > store Sesame.
>> > I am trying to store a dataset that consists of approximately
>> 900.000.000
>> > insertions (organized in either 10 tables, or in an arbitrary number of
>> > tables exceeding 2000). Loading the data in 10 tables takes about two
>> days,
>> > while in the 2nd case, an org.postgresql.util.PSQLException: ERROR: out
>> of
>> > shared memory error is thrown.
>>
>> I'm guessing that you're seeing this through some client-side code that
>> helpfully suppresses the HINT about raising max_locks_per_transaction
>> :-(
>>
>> If your application tries to touch all 2000 tables in one transaction,
>> you will need to raise that parameter to avoid running out of locktable
>> space for the AccessShareLocks it needs on all those tables.
>>
>> However ... generally speaking, I'd guess that whatever you did to
>> refactor 10 tables into 2000 was a bad idea --- one table with an extra
>> key column is almost always a better design than N basically-identical
>> tables. The latter will be a *lot* harder to use, and probably won't
>> fix your performance problem anyway. You should attack the performance
>> problem in a different way. Have you read
>> http://www.postgresql.org/docs/9.0/static/populate.html ?
>>
>> regards, tom lane
>>
>
>
>
> --
> ===================================================
> Manos Karpathiotakis
> National & Kapodistrian University of Athens
> Department of Informatics & Telecommunications, Office B25
> Management of Data & Information Knowledge Group
> Panepistimioupolis, Ilissia
> GR-15784 Athens, Greece
> Tel: +30.210.727.5159
> Fax: +30.210.727.5214
> e-mail: mk(at)di(dot)uoa(dot)gr
> ===================================================
>
>

--
// Dmitriy.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Guillaume Lelarge 2011-03-16 20:39:03 Re: Custom install options via apt-get install on ubuntu
Previous Message Tomas Vondra 2011-03-16 20:13:51 Re: query taking much longer since Postgres 8.4 upgrade