Re: [PATCH] json_lex_string: don't overread on bad UTF8

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date: 2024-05-02 23:29:18
Message-ID: CAOYmi+k_U2zuXhCswwBbo8fH743n4Gzr=hsYH+eyBa+mEOs7Rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 1, 2024 at 8:40 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> > About the fact that we may finish by printing unfinished UTF-8
> > sequences, I'd be curious to hear your thoughts. Now, the information
> > provided about the partial byte sequences can be also useful for
> > debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

> By the way, as long as I have that in mind.. I am not sure that it is
> worth spending cycles in detecting the unfinished sequences and make
> these printable. Wouldn't it be enough for more cases to adjust
> token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

> Another thing that I think would be nice would be to calculate the
> location of what we're parsing on a given line, and provide that in
> the error context. That would not be backpatchable as it requires a
> change in JsonLexContext, unfortunately, but it would help in making
> more sense with an error if the incomplete byte sequence is at the
> beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Devrim Gündüz 2024-05-02 23:36:33 Weird "null" errors during DROP TYPE (pg_upgrade)
Previous Message Tom Lane 2024-05-02 23:19:34 Re: Removing unneeded self joins