Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joel Jacobson <joel(at)compiler(dot)org>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Isaac Morland <isaac(dot)morland(at)gmail(dot)com>, Chapman Flack <chap(at)anastigmatix(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
Date: 2021-05-05 19:22:17
Message-ID: edcb24cf-e38f-bd08-2807-2a98242f95a0@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 5/5/21 2:45 PM, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
>> I think you misunderstood the problem.
>> I don't want the entire file to be considered a single value.
>> I want each line to become its own row, just a row with a single column.
>> So I actually think COPY seems like a perfect match for the job,
>> since it does precisely that, except there is no delimiter in this case.
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
> altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.

+1

Also we have generally been resistant to supporting odd formats. FDWs
can help here (e.g. file_text_array), but they can't use STDIN IIRC.

>
>> I'm currently using the pg_read_file()-hack in a project,
>> and even though it can read files up to 1GB,
>> using e.g. regexp_split_to_table() to split on E'\n'
>> seems to need 4x as much memory, so it only
>> works with files less than ~256MB.
> Yeah, that's because of the conversion to "chr". But a regexp
> is overkill for that anyway. Don't we have something that will
> split on simple substring matches?
>
>

Not that I know of. There is split_part but I don't think that's fit for
purpose here. Do we need one, or have I missed something?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-05-05 19:26:53 Re: MaxOffsetNumber for Table AMs
Previous Message Jeff Davis 2021-05-05 19:09:17 Re: MaxOffsetNumber for Table AMs