Re: Reducing the overhead of NUMERIC data

From: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, mark(at)mark(dot)mielke(dot)cc, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing the overhead of NUMERIC data
Date: 2005-11-04 19:49:27
Message-ID: e692861c0511041149n6fe36345oba7c43d1d48bef3d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On 11/4/05, Martijn van Oosterhout <kleptog(at)svana(dot)org> wrote:
> Yeah, and while one way of removing that dependance is to use ICU, that
> library wants everything in UTF-16. So we replace "copying to add NULL
> to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The
> argument for UTF-16 is that if you're using a language that doesn't use
> ASCII at all, UTF-8 gets inefficient pretty quickly.

Is this really the case? Only unicode values 000800 - 00FFFF are
smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs
two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both.

So, yes in some cases UTF-8 will use three bytes where UTF-16 would be
two, but thats less inefficient than UTF-16 for ASCII, which many
people find acceptable.

> Locale sensetive, efficient storage, fast comparisons, pick any two!

I don't know that the choices are that limited, as I indicated earlier
in the thread I think it's useful to think of all of these encodings
as just different compression algorithms. If our desire was to have
all three, the backend could be made null safe and we could use the
locale-sensitive and fast representation (Probably UTF-16 or UTF-32)
in memory, and store on disk whatever is most efficient for storage.
(lz compressed UTF-whatever for fat fields, UTF-8 for mostly ascii
small fields, SCSU for non-ascii short fields
(http://www.unicode.org/reports/tr6/) etc)

> My guess is that in the long run there would be two basic string
> datatypes, one UTF-8, null terminated string used in the backend code
> as a standard C string, default collation strcmp. The other UTF-16 for
> user data that wants to be able to collate in a locale dependant way.

So if we need locale dependant colation we suffer 2x inflation for
many texts, and multibyte complexity still required if we are to
collate correctly when there are characters outside of the BMP. Yuck.

Disk storage type, memory strorage type, user API type, and collation
should be decoupled.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-11-04 19:51:31 Re: [OT] somebody could explain this?
Previous Message Tom Lane 2005-11-04 19:45:41 Re: Crash during elog.c...

Browse pgsql-patches by date

  From Date Subject
Next Message Gregory Maxwell 2005-11-04 19:58:05 Re: Reducing the overhead of NUMERIC data
Previous Message Tom Lane 2005-11-04 19:43:16 Re: Reducing the overhead of NUMERIC data