Re: varchar as primary key

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Staubo <alex(at)purefiction(dot)net>, Matthew Hixson <hixson(at)poindextrose(dot)org>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: varchar as primary key
Date: 2007-05-04 17:46:51
Message-ID: 1178300811.12000.41.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 2007-05-03 at 23:08 -0400, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > If you're using a non-C locale, it's slower than strcmp() too.
> > PostgreSQL has to do an extra memcpy() in order to use strcoll(),
> > because strings in postgresql aren't necessarily NULL-terminated and
> > there's no such thing as strncoll(), unfortunately (a comment in the
> > code points this out).
>
> The memcpy is the least of the problem --- in many non-C locales,
> strcoll() is simply a dog, because the collation rules are ridiculously
> complex.
>

I was going by:

src/backend/utils/adt/varlena.c

/*
* Unfortunately, there is no strncoll(), so in the non-C locale case we
* have to do some memory copying. This turns out to be significantly
* slower, so we optimize the case where LC_COLLATE is C. We also try
to
* optimize relatively-short strings by avoiding palloc/pfree overhead.
*/

I tried with some simple C code, and it looks like the memcpy() does
account for a significant slowdown -- at least in my case (en_US.UTF-8).

In my tests, I was just comparing two strings:
"abcdefghijklmnop1"
"abcdefghijklmnop2"
(which seem to be "normal" length for a string that might be sorted)

I used strcmp() and strcoll() in a tight loop, and the result was
indistinguishable. However, when I added in two memcpy()s -- which are
necessary for any non-C locale -- it slowed down drastically (an order
of magnitude).

I also made two test data directories on my workstation, one C and one
UTF-8, and then I made a table in each consisting of 1M records of md5
(random()). The one with locale C took about 9 seconds to sort, and the
one with en_US.UTF-8 took about 16 seconds to sort.

So, I think for some locales memcpy() is the problem and in others the
memcpy() overhead may be overshadowed by strcoll().

When memcpy() is the problem it is disappointing because it would be
completely unnecessary if only there existed a strncoll() :(

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2007-05-04 17:52:48 Re: varchar as primary key
Previous Message Merlin Moncure 2007-05-04 17:23:38 Re: varchar as primary key