Re: Bug in UTF8-Validation Code?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 20:28:53
Message-ID: 45FC4F85.7090804@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I wrote:
>
>> Actually, I have to take back that objection: on closer look, COPY
>> validates the data only once and does so before applying its own
>> backslash-escaping rules. So there is a risk in that path too.
>>
>
>
>> It's still pretty annoying to be validating the data twice in the
>> common case where no backslash reduction occurred, but I'm not sure
>> I see any good way to avoid it.
>>
>
> Further thought here: if we put encoding verification into textin()
> and related functions, could we *remove* it from COPY IN, in the common
> case where client and server encodings are the same? Currently, copy.c
> forces a trip through pg_client_to_server for multibyte encodings
> even when the encodings are the same, so as to perform validation.
> But I'm wondering whether we'd still need that. There's no risk of
> SQL injection in COPY data. Bogus input encoding could possibly
> make for confusion about where the field boundaries are, but bad
> data is bad data in any case.
>
> regards, tom lane
>
>

Here are some timing tests in 1m rows of random utf8 encoded 100 char
data. It doesn't look to me like the saving you're suggesting is worth
the trouble.

baseline:

Time: 28228.325 ms
Time: 25987.740 ms
Time: 25950.707 ms
Time: 25756.371 ms
Time: 27589.719 ms
Time: 25774.417 ms

after adding suggested extra test to textin():

Time: 26722.376 ms
Time: 28343.226 ms
Time: 26529.364 ms
Time: 28020.140 ms
Time: 24836.853 ms
Time: 24860.530 ms

Script is:

\timing
create table xyz (x text);
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
drop table xyz;

Test platform: FC6, Athlon64.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian G. Pflug 2007-03-17 20:44:29 Re: Project suggestion: benchmark utility for PostgreSQL
Previous Message Hiroshi Saito 2007-03-17 19:23:58 Re: Bison 2.1 on win32