Re: questions on (parallel) COPY and when to REINDEX

From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: Janet Jacobsen <jsjacobsen(at)lbl(dot)gov>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: questions on (parallel) COPY and when to REINDEX
Date: 2009-08-02 12:09:11
Message-ID: 1B26FE05-7452-45EB-B547-32061E025CE6@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:

> Both of the big tables (now up to > 15 M rows each) have
> indexes on several of the columns. The indexes were
> created using CREATE INDEX CONCURRENTLY...
> Both tables have one or two foreign key constraints.

Others have answered some of your questions already, so I snipped a few.

> My questions are:
> (2) Should I REINDEX these two tables daily after the pipeline
> completes? Is this what other people do in practice?

No need, but as people are querying as soon as data is arriving, an
analyse of the table you just copied to should help performance - the
query plans will be derived from more accurate table statistics that
way.

> (3) Currently the pipeline executes in serial fashion. We'd
> like to cut the wall clock time down as much as possible.
> The data processing and data analysis can be done in parallel,
> but can the loading of the database be done in parallel, i.e.,
> can I execute four parallel COPY commands from four copies
> of a script? Our initial attempt at doing this failed. I found one
> posting in the archives about parallel COPY, but it doesn't seem
> to be quite on point.

As long as you're not using the same database-session in parallel you
should be fine. You can't do parallel database operations in the same
session. Last time I did something similar I used separate database
connections.

> (5) If I drop the indexes and foreign key constraints, then is it
> possible to COPY to a table from more than one script, i.e., do
> parallel COPY? It seems like a really bad idea to drop those
> foreign key constraints.

You can COPY in parallel, but having no FK's does help insert-
performance. In that case whether you should or shouldn't remove your
FK's depends on what's more important to you; insert performance or
data correctness.

As some of your users query the data while it's still coming in I
guess that data correctness is in fact more important to you and you
should keep the FK's.

You wrote that your pipeline runs for a period of 4 hours and the
table is about 15M rows now. What is taking up all that time? I
understand why you'd want to parallelise that process, but where do
you expect the gains?

From the above I'm guessing that part of the problem you want to
solve by parallelising is insert-performance. In cases like these I've
seen people with problems with the look-up speed of foreign keys
because the database chose a bad query plan. Regular analyses during
inserting data should prevent that.

> Should re-think about where our database loading fits into the
> overall pipeline, i.e., do the data processing and analysis in
> parallel, but keep the data loading sequential? The reason for
> not doing all of the data loading at the end is that some of the
> users *really* want to start querying the data and analysis
> results as soon as they enter the database.

You seem to have a design where the database grows indefinitely? Your
database doesn't look to be very big yet, but I expect that at some
point in the future your data will grow so much that the ordering of
it on disk starts to matter for how fast records can be fetched.
That's a typical scenario where people here start to advise using
table partitioning.

The thing is though that partitioning only works well if the queries
your users perform contain clauses of which the database knows they
divide the data (the same clauses used to partition the table).

The people you work with apparently are knowledgeable enough that they
create their own queries. They will have to take partition constraints
into account too if you choose to use table partitioning.

> Looking forward to your replies.
>
> Janet

Regards,
Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4a7581ec10134875916639!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Pavel Stehule 2009-08-02 12:20:18 Re: Division by zero
Previous Message Sam Mason 2009-08-02 11:43:57 Re: Division by zero