From: | Janet Jacobsen <jsjacobsen(at)lbl(dot)gov> |
---|---|
To: | Andy Colson <andy(at)squeakycode(dot)net> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: questions on (parallel) COPY and when to REINDEX |
Date: | 2009-08-03 02:50:53 |
Message-ID: | 4A76508D.6010509@lbl.gov |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi, Andy. Thanks for the responses and information.
Just to let you know... what we are storing in the db are the image
attributes - about 40 of them - not the images. So the COPY
is reading an ascii file of the image attributes. It turns out to be
useful to have the image attributes handy - much better than reading
the image headers. The images are available on spinning disk, and
the image locations are in the db.
Thanks,
Janet
On 02/08/2009 05:59 p.m., Andy Colson wrote:
>>> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>>>> My questions are:
>>>> (2) Should I REINDEX these two tables daily after the pipeline
>>>> completes? Is this what other people do in practice?
>>>
>>> No need, but as people are querying as soon as data is arriving, an
>>> analyse of the table you just copied to should help performance -
>>> the query plans will be derived from more accurate table statistics
>>> that way.
>>
>> The files that are being copied into the table(s) are between 200 to
>> 1,000
>> each, but there are hundreds of these small files every day. Would you
>> recommend running analyze after every COPY?
>
> Just once, at the end. Assuming you are only deleting 100k records
> and re-adding/updating another 100k in a batch. That's not so many
> records it'll through the stats out of whack. (If you were
> dropping/updating 15M in a batch you might wanna analyze in the middle)
>
>>>
>>> You wrote that your pipeline runs for a period of 4 hours and the
>>> table is about 15M rows now. What is taking up all that time? I
>>> understand why you'd want to parallelise that process, but where do
>>> you expect the gains?
>>
>>
>> We're processing images, and the data processing and analysis takes
>> up most of the time, but the images can be processed/analyzed in
>> parallel.
>> We've been doing all of the data loading at the end - one COPY at a
>> time. Originally that made sense because the researchers wanted to
>> check the images before loading the data/analysis results into the db.
>
> Ah! Images! When you are doing the COPY are you escaping the data?
> You cant "just" copy a binary file.
>
> -Andy
From | Date | Subject | |
---|---|---|---|
Next Message | sweta | 2009-08-03 05:41:25 | Re: Drop Cluster] |
Previous Message | Ilya Urikh | 2009-08-03 02:45:53 | Problem with Numerics multiplication in C-function |