| From: | Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> | 
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org | 
| Cc: | "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp> | 
| Subject: | Patch for server-side encoding issues | 
| Date: | 2009-04-15 05:14:00 | 
| Message-ID: | 20090415135043.C439.52131E4D@oss.ntt.co.jp | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Here is a WIP patch to solve server-side encoding issues.
It includes "Solution of the file name problem of copy on windows" patch.
    http://archives.postgresql.org/message-id/20090413184335.39BE.52131E4D@oss.ntt.co.jp
It could solve the following issues. They are not only in Windows nor
Japan-specific problems. They could also occur if you use databases
with mulitple encodings or database with non-platform-native encoding
even on POSIX platforms.
<1> Non-ascii file paths for database that encoding is different from
    platform's encoding (that comes from $LANG or Windows codepage),
    especially for COPY TO/FROM.
<2> Use appropriate encoding for non-text server log (console, syslog
    and eventlog). The encoding is the same as <1>.
<3> Use appropriate encoding for text server log (stderr and csvlog),
    especially database cluster has databases with a variety of encoding.
    New GUC parameter 'log_encoding' specifies the encoding in server log.
<4> (incomplete) Avoid encoding conversion error in printing server log
    and messages for client. Instead of error, print '?' if there is no
    equivalent character in the target encoding.
For <4>, I use PG_TRY and PG_CATCH for now, but it must be a bad manner.
Instead, I'm thinking that convertion procedures will take an optional
argument whether it should raise error or not. However, we need to
modify all of conversion functions to do so.
More research is needed against following situations:
  - NLS messages
  - Module path for LOAD
  - Arguments for system(), including archive_command and restore_command
  - Query texts for other database in pg_stat_activity and pg_stat_statements
Comments welcome. Please notify me if I'm missing something.
Here is a sample code to test the patch.
(client_encoding = sjis / system encoding = sjis)
----
C:\home\>createdb utfdb --encoding=utf8 --locale=C
C:\home\>createdb eucdb --encoding=eucjp --locale=C
C:\home\>psql utfdb -c "COPY (SELECT 1) TO 'C:/home/日本語ファイル.txt'"
C:\home\>psql utfdb -c "SELECT '日本語' WITH ERROR"
ERROR:  syntax error at or near "WITH ERROR"
LINE 1: SELECT '日本語' WITH ERROR
                        ^
C:\home\>psql eucdb -c "COPY (SELECT 1) TO 'C:/home/日本語ファイル.txt'"
C:\home\>psql eucdb -c "SELECT '日本語' WITH ERROR"
ERROR:  syntax error at or near "WITH ERROR"
LINE 1: SELECT '日本語' WITH ERROR
                        ^
----
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
| Attachment | Content-Type | Size | 
|---|---|---|
| server-side_encoding_issues_20090415.patch | application/octet-stream | 28.0 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Xin Wang | 2009-04-15 05:32:25 | Memory exhaustion during bulk insert | 
| Previous Message | Stephen Frost | 2009-04-15 00:02:45 | Re: Replacing plpgsql's lexer |