| From: | Jim Nasby <jim(at)nasby(dot)net> | 
|---|---|
| To: | "Jonathan S(dot) Katz" <jonathan(dot)katz(at)excoventures(dot)com> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Karol Trzcionka <karlikt(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: RETURNING syntax for COPY | 
| Date: | 2013-05-08 19:23:26 | 
| Message-ID: | 518AA62E.8080708@nasby.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 5/8/13 12:54 PM, Jonathan S. Katz wrote:
> On May 8, 2013, at 1:16 PM, Tom Lane wrote:
>
>> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>>> On 08.05.2013 19:44, Tom Lane wrote:
>>>> No there isn't; what you suggest would require FE/BE protocol
>>>> extensions, making it several orders of magnitude more work than the
>>>> other thing.
>>
>>> I'd imagine that the flow would go something like this:
>>
>>> BE	FE
>>
>>> CopyInResponse
>>> 	CopyData
>>> 	CopyData
>>> 	...
>>> 	CopyDone
>>> RowDescription
>>> DataRow
>>> DataRow
>>> CommandComplete
>>
>> That would require the backend to buffer the entire query response,
>> which isn't a great idea.  I would expect that such an operation would
>> need to interleave CopyData to the backend with DataRow responses.  Such
>> a thing could possibly be built on COPY_BOTH mode, but it would be a lot
>> of work (at both ends) for extremely debatable value.
>>
>> The general idea of COPY is to load data as fast as possible, so weighing
>> it down with processing options seems like a pretty dubious idea even if
>> the implementation were easy.
>
> There are cases that I indeed want to load data very quickly, but I want to perform an operation on it immediately after, e.g. removing bad data that was immediately added from that copy.  For instance, I do have this scenario:
>
> WITH new_data AS (
> 	COPY FROM ...
> 	RETURNING id, field_to_check
> )
> DELETE FROM table
> USING new_data
> WHERE
> 	table.id = new_data.id AND
> 	new_data.field_to_check ~* 'bad data';
>
> Now I can take care of that all in one step, and I know I'm only removing fields I just added.  This comes up when I am importing external files from other sources where I may not necessarily want all of the rows or some of the rows contain bad data.
>
> This also presumes that COPY works in a CTE, which I'm not sure it does (and I will do the TIAS test after I hit send on this message).
What you're really asking for here is some kind of stream processing capability. There are spin-offs of Postgres that provide that capability (I know Neil Conway worked on some). Those are geared completely around stream processing, but I think it would be extremely interesting to provide some minimal support for that in community Postgres.
Using your use case as an example, something like this would be very interesting:
COPY table FROM ...
   WHERE field_to_check !~* 'bad data'
;
In this case we're just applying a simple WHERE clause against each incoming row.
Perhaps what I'm suggesting could be implemented with a CTE, but I'm not sure it makes sense to do it the way you propose, at least not initially. A CTE would provide so much flexibility that it'd be difficult for the optimizer to be efficient about it. Something like a WHERE clause directly on COPY would be a lot easier to handle. As someone mentioned, FDW might be another option there.
-- 
Jim C. Nasby, Data Architect                       jim(at)nasby(dot)net
512.569.9461 (cell)                         http://jim.nasby.net
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Nasby | 2013-05-08 19:27:41 | Re: RETURNING syntax for COPY | 
| Previous Message | Martijn van Oosterhout | 2013-05-08 19:17:43 | Re: about index inheritance |