Quick Links

Re: autocommit (true/false) for more than 1 million records

From:	David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: autocommit (true/false) for more than 1 million records
Date:	2014-08-27 03:40:19
Message-ID:	CAKFQuwa2Pz6EWnG6_+aKG8wpmXf9U2RthH8B02FzRpfaOvQ1cQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Tue, Aug 26, 2014 at 9:21 PM, Alex Goncharov-2 [via PostgreSQL] <
ml-node+s1045698n5816426h30(at)n5(dot)nabble(dot)com> wrote:

> Thank you, Kevin -- this is helpful.
>
> But it still leaves questions for me.
>
>
> Kevin Grittner <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=5816426&i=0>> wrote:
>
> > Alex Goncharov <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=5816426&i=1>> wrote:
>
> > > The whole thing is aborted then, and the good 99 records are not
> > > making it into the target table.
> >
> > Right. This is one reason people often batch such copies or check
> > the data very closely before copying in.
>
> How do I decide, before starting a COPY data load, whether such a load
> protection ("complexity") makes sense ("is necessary")?
>
>
You should probably consider something like:

http://pgloader.io/

(I know there are others, this one apparently has the best marketing
team...)

Normal case, with normal COPY, you load a bad file into an empty table, it
fails, you truncate and get better data for the next attempt.

How long that will take is system (IOPS/CPU) and data dependent.

The probability of failure is source dependent - and prior experience plays
a large role here as well.

If you plan to load directly into a live table the wasted space from a bad
load could kill you so smaller partial loads are better - if you can afford
the implicit system inconsistency such a partial load would cause.

If you understand how the system works you should be able to evaluate the
different pieces and come to a conclusion as how best to proceed in a
specific situation. No one else on this list has the relevant information
to make that judgement call. If this is just asking about rules-of-thumb
I'd say figure out how many records 100MB consumes and COMMIT after that
many records. 10,000 records is also a nice round number to pick -
regardless of the amount of MB consumed. Start there and tweak based upon
experience.

> If you are not taking advantage of the "unlogged load" optimization,
> > you will have written Write Ahead Log (WAL) records, too -- which
> > (depending on your configuration) you may be archiving. In that
> > case, you may need to be concerned about the archive space required.
>
> "... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
>
> If I am always concerned, and check something before a COPY, what
> should I be checking? What are the "OK-to-proceed" criteria?
>
>
If you only have 500k free in your archive directory that 1MB file will
pose a problem...though if you have 4TB of archive available the 1TB would
fit easily. Do you compress your WAL files before shipping them off to the
archive? How compressible is your data?

I'm sure people have decent rules-of-thumb here but in the end your
specific environment and data, especially at the TB scale, is going to be
important; and is something that you will only discover through testing.

>
> > If you have foreign keys defined for the table, you may get into
> > trouble on the RAM used to track pending checks for those
> > constraints. I would recommend adding any FKs after you are done
> > with the big bulk load.
>
> I am curious about the simplest case where only the data storage is to
> be worried about. (As an aside: the CHECK and NOT NULL constrains are
> not a storage factor, right?)
>
>
Correct

>
> > PostgreSQL does *not* have a "rollback log" which will impose a
> > limit.
>
> Something will though, right? What would that be? The available disk
> space on a file system? (I would be surprised.)
>
>
> > > Say, I am COPYing 100 TB of data and the bad records are close
> > > to the end of the feed -- how will this all error out?
> >
> > The rows will all be in the table, but not visible to any other
> > transaction.
>
> I see. How much data can I fit there while doing COPY? Not 1 TB?
>
> -- Alex
>

You need the same amount of space that you would require if the file
imported to completion.

PostgreSQL is optimistic in this regard - it assumes you will commit and
so up until failure there is no difference between a good and bad import.
The magic is described in Slide 24 of the MVCC link above (
http://momjian.us/main/writings/pgsql/mvcc.pdf) - if the transaction is
aborted then as far as the system is concerned the written data has been
deleted and can be cleaned up just like if the following sequence of
commands occurred:

BEGIN;
COPY tbl FROM ....;
COMMIT; ---success
DELETE FROM tbl ....;

Hence the comment to "TRUNCATE" after a failed load if at all possible -
to avoid the unnecessary VACUUM on tbl...

QUESTION: would the vacuum reclaim the disk space in this situation (I
presume yes) because if not, and another imported was to be attempted,
ideally the allocated space could be reused.

I'm not sure what a reasonable formula would be, especially at the TB
scale, but roughly 2x the size of the imported (uncompressed) file would be
a good starting point (table + WAL). You likely would want many multiples
of this unless you are dealing with a one-off event. Indexes and dead
tuples in particular are likely to be involved. You get some leeway
depending on compression but that is data specific and thus something you
have to test yourself if you are operating at the margin of your system's
resources.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/autocommit-true-false-for-more-than-1-million-records-tp5815943p5816460.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

In response to

Re: autocommit (true/false) for more than 1 million records at 2014-08-27 01:20:45 from Alex Goncharov

Responses

Re: autocommit (true/false) for more than 1 million records at 2014-08-27 05:02:03 from Alex Goncharov

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Mark Kirkwood	2014-08-27 04:24:12	Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3
Previous Message	Alex Goncharov	2014-08-27 01:20:45	Re: autocommit (true/false) for more than 1 million records