Re: copy from questions

From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Kirk Wythers <kirk(dot)wythers(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: copy from questions
Date: 2012-12-19 21:46:48
Message-ID: 50D235C8.3010408@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12/19/2012 01:06 PM, Kirk Wythers wrote:
> Thanks for the reply Steve. These suggestions are new to me, so I'd
> like to rephrase them back to you in order to make sure I understand
> the bits and details.
>
>
>
> On Dec 19, 2012, at 10:48 AM, Steve Crawford
> <scrawford(at)pinpointresearch(dot)com
> <mailto:scrawford(at)pinpointresearch(dot)com>> wrote:
>>>
>> I suppose you could use a trigger to check each record before
>> inserting but that is likely to be inefficient for bulk loads. A
>> quick bash loop is probably your best bet. Something along the lines of:
>>
>> for inputfile in /infiledirectory/*.csv
>> do
>> cat inputfile | psql [connection-params] -c '\copy rawinput from
>> stdin csv header...'
>> done
>
> I get this… If my except for the header… bit. Here is my
> interpretation of your code including my specific connection parameters.
>
> *#!/bin/sh*
> *for inputfile in '/Volumes/disk7/b4warmed3/export/60min2/*.txt'*
> *do*
> * cat inputfile | psql -p 54321 -h localhost -c '\copy rawinput FROM
> stdin WITH CSV HEADER DELIMTER AS ',' NULL AS 'NA '*
> *done*
You have a mixture of bash and PostgreSQL/psql issues.

Bash first: You don't need to wrap the inputfile glob in quotes. Your
filespec will be expanded to a list of the files matching your pattern.
I'm assuming that you have not populated your directory with filenames
that contain spaces. If so, either correct it or start Googling for how
to deal with lists of space-containing file names.

Also, you will have problems with using single-quotes within a string
delimited by single-quotes. Combine that with the psql problem where it
appears you have to specify delimiter before CSV (though you probably
don't need to specify delimiter at all if your delimiter is a comma) and
things will blow up. Try:

.... -c "\copy rawinput FROM stdin with delimiter as ',' NULL AS 'NA '
CSV HEADER"

>
> I added single quotes around the path to the input files. Correct right?
>
>>
>> This imports everything into a "staging" table (I called it
>> rawinput). From there you can create your final table with SELECT
>> DISTINCT…
>
> This bit needs to be as a separate step right? (rowid is the primary key)
>
> *SELECT DISTINCT ON (rowid) **
> *FROM rawinput;*
>
> From here do I need another COPY FROM or some kind of INSERT statement?

I had assumed you had fully duplicated rows in which you could simply do;
INSERT INTO realtable SELECT DISTINCT * FROM rawinput;
(assuming realtable and rawinput have exactly the same structure and
column ordering, of course)

If that is not the case, you will need to update your select statement
to return one from among your duplicated IDs based on your desired criteria.

>
>>
>> For speed make sure that you create your staging table as "unlogged".
>
> I understand that I need to create the rawinput table first, but I am
> unfamiliar with the "UNLOGGED" option. I assume it makes things
> faster… Does it go something like:
>
> *CREATE TABLE UNLOGGED rawinput;
> *
Something like but not exactly. The syntax is CREATE UNLOGGED TABLE...
Except for UNLOGGED before TABLE it's exactly like a regular
create-table statement.

BEWARE: Unlogged is *NOT* a magic go-fast directive that comes at no
cost. In normal tables data is written to the transaction log allowing
the database to recover in the event of an unplanned shutdown. Unlogged
tables, as the name implies, skip the writing of the write-ahead log
gaining speed at the expense of data safety. For bulk-load operations
that can be redone if they fail this is a reasonable approach. For
normal operations where you expect your data to be well protected it is
not. Depending on a variety of factors the speed improvement you will
see ranges from "minimal" to "substantial"

>
> Do I need to create all the variables (including types) in rawinput as
> well? If so, then I assume that I do not want rowid to have a primary
> key… or else I would be back where I started.
I would make rawinput exactly the same as your real table except without
indexes or constraints. The easiest way is probably:
CREATE UNLOGGED TABLE rawinput as SELECT * FROM realtable limit 0;

>
>
>>
>> Cheers,
>> Steve
>>
>
> Thanks again,
>
> Kirk
>
>
Cheers,
Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2012-12-19 21:48:01 Re: pg_xlog is getting bigger
Previous Message Adrian Klaver 2012-12-19 21:37:08 Re: pg_xlog is getting bigger