Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From: Aaron Botsis <aaron(at)bt-r(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Francisco Olarte <folarte(at)peoplecall(dot)com>
Subject: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Date: 2015-01-07 13:25:34
Message-ID: 87B56448-146A-4ECD-96D3-8B09B89D4EEA@bt-r.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi folks, I was having a problem importing json data with COPY. Lots of things export data nicely as one json blob per line. This is excellent for directly importing into a JSON/JSONB column for analysis.

...Except when there’s an embedded doublequote. Or anything that’s escaped. COPY handles this, but by the time the escaped char hit the JSON parser, it's not escaped anymore. This breaks the JSON parsing. This means I need to manipulate the input data to double-escape it. See bug #12320 for an example. Yuck.

I propose this small patch that simply allows specifying COPY … ESCAPE without requiring the CSV parser. It will make it much easier to directly use json formatted export data for folks going forward. This seemed like the simplest route.

Usage is simply:

postgres=# copy t1 from '/Users/nok/Desktop/queries.json';
ERROR: invalid input syntax for type json
DETAIL: Token "root" is invalid.
CONTEXT: JSON data, line 1: ...1418066241619 AND <=1418671041621) AND user:"root...
COPY t1, line 3, column bleh: "{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."
postgres=# copy t1 from '/Users/nok/Desktop/queries.json' escape '';
COPY 1966

I’ve included regression tests, and all existing tests pass. This is my first contribution, so be kind to me. :)

> Begin forwarded message:
>
> From: Francisco Olarte <folarte(at)peoplecall(dot)com>
> Date: January 6, 2015 at 1:52:28 PM EST
> Subject: Re: [BUGS] BUG #12320: json parsing with embedded double quotes
> To: Aaron Botsis <aaron(at)bt-r(dot)com>
> Cc: postgres(at)bt-r(dot)com, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
>
> Hi Aaron:
>
> On Tue, Jan 6, 2015 at 7:06 PM, Aaron Botsis <aaron(at)bt-r(dot)com <mailto:aaron(at)bt-r(dot)com>> wrote:
> Hi Francisco, I’m aware, but still consider this to be a bug, or at least a great opportunity for an enhancement. :)
>
> Maybe, but you are going to have a problem.
>
> This had bitten me for the third time while trying to import some json data. It’d be great to bypass the copy escaping (and possibly other meta characters) when the column type is json or jsonb. I’d be happy to try and write it and submit a patch if folks believe this is an acceptable way to go… That said, I should probably read what the process is for this kind of thing :)
>
> Reading this, you are talking about 'the column being json'. COPY needs to do the escaping at the same time it's constructing the columns. The present way is easy to do, read char by char, if it's a escape, process next char acumulating into current field, otherwise see whether it is a field or record separator and act accordingly. It's also layered, when you construct the records for copy you get all the field data, turn them into escaped strings, join them by the field separator and spit them out followed by a record separator ( practical implementations may do this virtually ). Intermixing this with the 'I'm in a json column' would need to pass information from the upper layer, and make it more difficult and, specially error prone. What do you do if ( using the standard delimiters ) your json value has embeded newlines and tabs ( which, IIRC, are legal in several places inside the json ). And all this to make some incorrectly formatted files read ( which can be correctly formatted with a perl one liner or something similar ). I'm not the one to decide, but I will vote against including that ( but do not trust me too much, I would also vote against including 'csv' which I consider the root of many evils ).
>
> Francisco Olarte.
>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2015-01-07 15:12:10 Re: Re: BUG #11617: issue with dump/restore involving view with hstore data type embedded in where condition
Previous Message Noah Misch 2015-01-07 07:40:37 Re: BUG #11617: issue with dump/restore involving view with hstore data type embedded in where condition

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2015-01-07 13:44:51 Re: pg_rewind in contrib
Previous Message Michael Paquier 2015-01-07 13:11:30 Re: Fillfactor for GIN indexes