From: | "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: NOLOGGING option, or ? |
Date: | 2005-06-01 17:35:30 |
Message-ID: | BEC33FF2.50CC%agoldshuv@greenplum.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I have been working on improving the COPY command performance and as a
result also came up with other thoughts that may possibly be better off
implemented in a new command (i.e LOAD DATA) rather than adding them to the
existing COPY.
The improvements I made were in the COPY parsing logic - changing a
char-by-char parsing to a multi char buffered fast parsing, while using a
bytebuffer which is faster than StringInfoData and minimizing the number of
loads into the buffer. The data conversion and insertion parts of COPY I
left untouched. As a result the parsing performance increased by about 550%,
and the overall COPY performance increased by:
Around 40% for 15 column (mixed types) table.
Around 90% for 1 column table.
(the difference betweeen the two is caused by data conversion overhead).
I will post the patch and more numbers to the list later today with more
details. I'll just comment now that it is only available for delimited ASCII
input data when client and server encodings are the same. CSV and encoding
conversions may be added later, this is merely to show that data could be
loaded much faster.
Here are some things that make me think a new LOAD command is a good idea:
1) There seem to be a possibility that many COPY modifications/improvements
may be problematic to incorporate in the current postgres COPY code. Further
more, it may be desired to keep the COPY command as is and also have a way
to run an improved COPY command for purposes of backwards compatibility.
2) A modified command syntax for introducing a direct single row error
handling. By direct I mean - a row that if rejected from within the COPY
command context does not throw an error and rollsback the whole transaction.
Instead the error is caught and recorded elsewhere, maybe in some error
table, with some more information that can later on be retrieved. The
following rows continue to be processed. This way there is barely any error
handling overhead. Having a recursive row isolation into smaller batches is
extremely expensive for non-small data sets. It's not an option for serious
users.
3) maybe have an option to indicate the EOL (end of line) format in the
command syntax. Current COPY code detects the line-end according to the
first data line, this is problematic is 1st data line is mal formatted, and
also this doesn't allow having any CR's for example in the data file when
EOL is only a linefeed. That causes extra data errors in COPY processing. A
CR is a valid data character. Specifying the line end in command syntax will
save all this badness.
4) Data integrity and escaping improvements. My patch changes now treats all
characters as data (unless it's an escaped delim or EOL) and therefore data
integrity is preserved (take for example the following valid data field
"file:\new\bang" that after COPY into the database, querying for it from
psql will result is wrong data -- backslashes are gone, there are 2 data
lines, and a bell will ring for \b!) However, some people that already got
used to the postgres COPY escaping way may want to keep it. They could do so
by still using the old COPY.
5) allow an ERRORLIMIT to allow control of aborting a load after a certain
number of errors (and a pre-requisite for this is point number 2 above).
6) allow LIMIT and OFFSET, for files with header rows for example (could be
done in COPY too).
7) Allow the blocks to be directly written to the table, rather than via
the buffer cache.
8) Allow a bulk index insertion operation at the end of the LOAD step, if
the data has been loaded in sorted order. Use something like the SORTED
INDEXES statement on Oracle sql*loader to specify the sort order of the
incoming data, so that the index build step can bypass another external
sort before loading directly into the index.
9) allow for Simon's WAL bypass.
I have surely missed some problems that hide behind the idea, but these
points make me believe that LOAD DATA is a good idea.
Alon.
On 5/31/05 7:47 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> Recent test results have shown a substantial performance improvement
>> (+25%) if WAL logging is disabled for large COPY statements.
>
> How much of that is left after we fix the 64-bit-CRC issue?
>
>> Now, I would like to discuss adding an enable_logging USERSET GUC,
>
> [ fear and loathing ... ]
>
> I don't like the idea of a GUC at all, and USERSET is right out.
> I think it would have to be system-wide (cf fsync) to be even
> implementable let alone somewhat predictable. Even if it could
> be done per-backend with reasonable semantics, random users should
> not get to make that decision --- it should be the DBA's call,
> which means it needs at least SUSET permissions.
>
> BTW, I'm sure you are the last one who needs to be reminded that
> any such thing breaks PITR completely. Which is surely sufficient
> reason not to let it be USERSET.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2005-06-01 17:55:46 | Re: NOLOGGING option, or ? |
Previous Message | Tom Lane | 2005-06-01 17:35:25 | Re: NOLOGGING option, or ? |