Re: autocommit (true/false) for more than 1 million records

From: Alex Goncharov <alex(dot)goncharov(dot)usa(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: autocommit (true/false) for more than 1 million records
Date: 2014-08-27 01:20:45
Message-ID: CAOnt2=MG4L3Prb=AQ3Etqs3kOZbPDw7Y-ap0MHn75GR0LfV2Fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thank you, Kevin -- this is helpful.

But it still leaves questions for me.

Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> Alex Goncharov <alex(dot)goncharov(dot)usa(at)gmail(dot)com> wrote:

> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right. This is one reason people often batch such copies or check
> the data very closely before copying in.

How do I decide, before starting a COPY data load, whether such a load
protection ("complexity") makes sense ("is necessary")?

Clearly not needed for 1 MB of data in a realistic environment.

Clearly is needed for loading 1 TB in a realistic environment.

To put it differently: If I COPY 1 TB of data, what criteria should I
use for choosing the size of the chunks to split the data into?

For INSERT-loading, for the database client interfaces offering the
array mode, the performance difference between loading 100 or 1000
rows at a time is usually negligible if any. Therefore 100- and
1000-row's array sizes are both reasonable choices.

But what is a reasonable size for a COPY chunk? It can't even be
measured in rows.

Note, that if you have a 1 TB record-formatted file to load, you can't
just split it in 1 MB chunks and feed them to COPY -- the file has to
be split on the record boundaries.

So, splitting the data for COPY is not a trivial operation, and if
such splitting can be avoided, a reasonable operator will avoid it.

But then again: when can it be avoided?

> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table. They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits. These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf

Yeah, I know about the MVCC model... The question is about the huge
data storage to be reserved without a commitment while the load is not
completed, about the size constrains in effect here.

> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress. What are they?
>
> Primarily disk space for the table.

How can that be found? Is "df /mount/point" the deciding factor? Or
some 2^32 or 2^64 number?

> If you are not taking advantage of the "unlogged load" optimization,
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving. In that
> case, you may need to be concerned about the archive space required.

"... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?

If I am always concerned, and check something before a COPY, what
should I be checking? What are the "OK-to-proceed" criteria?

> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints. I would recommend adding any FKs after you are done
> with the big bulk load.

I am curious about the simplest case where only the data storage is to
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)

> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.

Something will though, right? What would that be? The available disk
space on a file system? (I would be surprised.)

> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.

I see. How much data can I fit there while doing COPY? Not 1 TB?

-- Alex

On Tue, Aug 26, 2014 at 6:33 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> Alex Goncharov <alex(dot)goncharov(dot)usa(at)gmail(dot)com> wrote:
>
> > Suppose I COPY a huge amount of data, e.g. 100 records.
> >
> > My 99 records are fine for the target, and the 100-th is not --
> > it comes with a wrong record format or a target constraint
> > violation.
> >
> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right. This is one reason people often batch such copies or check
> the data very closely before copying in.
>
> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table. They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits. These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf
>
> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress. What are they?
>
> Primarily disk space for the table. If you are not taking
> advantage of the "unlogged load" optimization, you will have
> written Write Ahead Log (WAL) records, too -- which (depending on
> your configuration) you may be archiving. In that case, you may
> need to be concerned about the archive space required. If you have
> foreign keys defined for the table, you may get into trouble on the
> RAM used to track pending checks for those constraints. I would
> recommend adding any FKs after you are done with the big bulk load.
>
> PostgreSQL does *not* have a "rollback log" which will impose a limit.
>
> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction. Autovacuum will clean them out in the background, but
> if you want to restart your load against an empty table it might be
> a good idea to TRUNCATE that table; it will be a lot faster.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David G Johnston 2014-08-27 03:40:19 Re: autocommit (true/false) for more than 1 million records
Previous Message Kevin Grittner 2014-08-26 22:33:48 Re: autocommit (true/false) for more than 1 million records