Quick Links

Re: Reducing the overhead of NUMERIC data

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	mark(at)mark(dot)mielke(dot)cc, Gregory Maxwell <gmaxwell(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reducing the overhead of NUMERIC data
Date:	2005-11-04 19:11:27
Message-ID:	20051104191127.GE13966@svana.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-patches

On Fri, Nov 04, 2005 at 01:54:04PM -0500, Tom Lane wrote:
> mark(at)mark(dot)mielke(dot)cc writes:
> > I read "the backend is by and large an ASCII, null-terminated-string
> > engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the
> > code assumes varlena strings are '\0' terminated, and an assumption
> > on my part, that the varlena strings are not stored in the backend
> > with a '\0' terminator, therefore, they require being copied out,
> > terminated with a '\0', before they can be used?
>
> There are places where we have to do that, the worst from a performance
> viewpoint being in string comparison --- we have to null-terminate both
> values before we can pass them to strcoll().
>
> One of the large bits that would have to be done before we could even
> contemplate using UCS2/UCS4 is getting rid of our dependence on strcoll,
> since its API is null-terminated-string.

Yeah, and while one way of removing that dependance is to use ICU, that
library wants everything in UTF-16. So we replace "copying to add NULL
to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The
argument for UTF-16 is that if you're using a language that doesn't use
ASCII at all, UTF-8 gets inefficient pretty quickly.

Locale sensetive, efficient storage, fast comparisons, pick any two!

My guess is that in the long run there would be two basic string
datatypes, one UTF-8, null terminated string used in the backend code
as a standard C string, default collation strcmp. The other UTF-16 for
user data that wants to be able to collate in a locale dependant way.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

In response to

Re: Reducing the overhead of NUMERIC data at 2005-11-04 18:54:04 from Tom Lane

Responses

Re: Reducing the overhead of NUMERIC data at 2005-11-04 19:43:16 from Tom Lane
Re: Reducing the overhead of NUMERIC data at 2005-11-04 19:49:27 from Gregory Maxwell

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jim C. Nasby	2005-11-04 19:35:22	Crash during elog.c...
Previous Message	Otto Hirr	2005-11-04 19:06:15	Re: [OT] somebody could explain this?

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Tom Lane	2005-11-04 19:43:16	Re: Reducing the overhead of NUMERIC data
Previous Message	Tom Lane	2005-11-04 18:54:04	Re: Reducing the overhead of NUMERIC data