Re: questions on (parallel) COPY and when to REINDEX

From: Janet Jacobsen <jsjacobsen(at)lbl(dot)gov>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: questions on (parallel) COPY and when to REINDEX
Date: 2009-08-03 02:50:53
Message-ID: 4A76508D.6010509@lbl.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi, Andy. Thanks for the responses and information.

Just to let you know... what we are storing in the db are the image
attributes - about 40 of them - not the images. So the COPY
is reading an ascii file of the image attributes. It turns out to be
useful to have the image attributes handy - much better than reading
the image headers. The images are available on spinning disk, and
the image locations are in the db.

Thanks,
Janet

On 02/08/2009 05:59 p.m., Andy Colson wrote:
>>> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>>>> My questions are:
>>>> (2) Should I REINDEX these two tables daily after the pipeline
>>>> completes? Is this what other people do in practice?
>>>
>>> No need, but as people are querying as soon as data is arriving, an
>>> analyse of the table you just copied to should help performance -
>>> the query plans will be derived from more accurate table statistics
>>> that way.
>>
>> The files that are being copied into the table(s) are between 200 to
>> 1,000
>> each, but there are hundreds of these small files every day. Would you
>> recommend running analyze after every COPY?
>
> Just once, at the end. Assuming you are only deleting 100k records
> and re-adding/updating another 100k in a batch. That's not so many
> records it'll through the stats out of whack. (If you were
> dropping/updating 15M in a batch you might wanna analyze in the middle)
>
>>>
>>> You wrote that your pipeline runs for a period of 4 hours and the
>>> table is about 15M rows now. What is taking up all that time? I
>>> understand why you'd want to parallelise that process, but where do
>>> you expect the gains?
>>
>>
>> We're processing images, and the data processing and analysis takes
>> up most of the time, but the images can be processed/analyzed in
>> parallel.
>> We've been doing all of the data loading at the end - one COPY at a
>> time. Originally that made sense because the researchers wanted to
>> check the images before loading the data/analysis results into the db.
>
> Ah! Images! When you are doing the COPY are you escaping the data?
> You cant "just" copy a binary file.
>
> -Andy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message sweta 2009-08-03 05:41:25 Re: Drop Cluster]
Previous Message Ilya Urikh 2009-08-03 02:45:53 Problem with Numerics multiplication in C-function