From: | Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> |
---|---|
To: | lockhart(at)alumni(dot)caltech(dot)edu |
Cc: | t-ishii(at)sra(dot)co(dot)jp, Inoue(at)tpf(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Re: LIKE gripes |
Date: | 2000-08-11 08:13:47 |
Message-ID: | 20000811171347P.t-ishii@sra.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> To get the length I'm now just running through the output string looking
> for a zero value. This should be more efficient than reading the
> original string twice; it might be nice if the conversion routines
> (which now return nothing) returned the actual number of pg_wchars in
> the output.
Sounds resonable. I'm going to enhance them as you suggested.
> The original like() code allocates a pg_wchar array dimensioned by the
> number of bytes in the input string (which happens to be the absolute
> upper limit for the size of the 32-bit-encoded string). Worst case, this
> results in a 4-1 expansion of memory, and always requires a
> palloc()/pfree() for each call to the comparison routines.
Right.
There would be another approach to avoid use such that extra memory
space. However I am not sure it worth to implement right now.
> I think I have a solution for the current code; could someone test its
> behavior with MB enabled? It is now committed to the source tree; I know
> it compiles, but afaik am not equipped to test it :(
It passed the MB test, but fails the string test. Yes, I know it fails
becasue ILIKE for MB is not implemented (yet). I'm looking forward to
implement the missing part. Is it ok for you, Thomas?
> I am not planning on converting everything to UniCode for disk storage.
Glad to hear that.
> What I would *like* to do is the following:
>
> 1) support each encoding "natively", using Postgres' type system to
> distinguish between them. This would allow strings with the same
> encodings to be used without conversion, and would both minimize storage
> requirements *and* run-time conversion costs.
>
> 2) support conversions between encodings, again using Postgres' type
> system to suggest the appropriate conversion routines. This would allow
> strings with different but compatible encodings to be mixed, but
> requires internal conversions *only* if someone is mixing encodings
> inside their database.
>
> 3) one of the supported encodings might be Unicode, and if one chooses,
> that could be used for on-disk storage. Same with the other existing
> encodings.
>
> 4) this difference approach to encoding support can coexist with the
> existing MB support since (1) - (3) is done without mention of existing
> MB internal features. So you can choose which scheme to use, and can
> test the new scheme without breaking the existing one.
>
> imho this comes closer to one of the important goals of maximizing
> performance for internal operations (since there is less internal string
> copying/conversion required), even at the expense of extra conversion
> cost when doing input/output (a good trade since *usually* there are
> lots of internal operations to a few i/o operations).
>
> Comments?
Please note that existing MB implementation does not need such an
extra conversion cost except some MB-aware-functions(text_length
etc.), regex, like and the input/output stage. Also MB stores native
encodings 'as is' onto the disk.
Anyway, it looks like MB would eventually be merged into/deplicated by
your new implementaion of multiple encodings support.
BTW, Thomas, do you have a plan to support collation functions?
--
Tatsuo Ishii
From | Date | Subject | |
---|---|---|---|
Next Message | Allan Huffman | 2000-08-11 09:37:52 | db Comparisons - Road Show |
Previous Message | Stephan Szabo | 2000-08-11 06:24:16 | Re: Arrays and foreign keys |