Quick Links

Re: copy vs. C function

From:	Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To:
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: copy vs. C function
Date:	2011-12-13 14:29:52
Message-ID:	CAKuK5J0EzKnvAfNk=JKd2Mw3q25fguHKYwTQCkYvxPYA+TDKPw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>> I was experimenting with a few different methods of taking a line of
>> text, parsing it, into a set of fields, and then getting that info
>> into a table.
>>
>> The first method involved writing a C program to parse a file, parse
>> the lines and output newly-formatted lines in a format that
>> postgresql's COPY function can use.
>> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
>> output new data to new file -- 4 seconds, COPY new file -- 10
>> seconds).
>>
>> The next approach I took was to write a C function in postgresql to
>> parse a single TEXT datum into an array of C strings, and then use
>> BuildTupleFromCStrings. There are 8 columns involved.
>> Eliding the time it takes to COPY the (raw) file into a temporary
>> table, this method took 120 seconds, give or take.
>>
>> The difference was /quite/ a surprise to me. What is the probability
>> that I am doing something very, very wrong?
>>
>> NOTE: the code that does the parsing is actually the same,
>> line-for-line, the only difference is whether the routine is called by
>> a postgresql function or by a C program via main, so obviously the
>> overhead is elsewhere.
>> NOTE #2: We are talking about approximately 2.6 million lines.
>
>
> Let me throw out an interesting third method I've been using to parse
> delimited text files that might be useful in your case. This is
> useful when parsing text that is bad csv where values are not escaped
> or there are lines, incomplete and/or missing records, or a huge
> amount of columns that you want to rotate into a more normalized
> structure based on columns position.
>
> 1. Import the data into a single column (containing the entire line)
> staging table, feeding the COPY parser a bogus delimiter
> 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
> 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
> you can work it into your problem, INSERT/SELECT, expanding the array
> with a trick like used in information_schema._pg_expandarray so you
> can hook logic on the array (column position).

If you replace [2] with my C function (which can process all of the
data, *postgresql overhead not included*, in about 1 second) then
that's what I did. It returns a composite type making [3] unnecessary.

I know it's not parsing, so I started a time honored debugging
approach: I returned early.

Is the function-call overhead that high? That's outrageously high.
What else could it be? Is returning a composite type outragously
expensive?
So here is what I did: I modified the code so that it immediately returns NULL.
Result: 2 seconds.
Extract arguments, allocate temporary work buffer: another 0.5 seconds.
Add parsing: another 1.5 seconds [total: 4.1 seconds]

and so on...

Two of the items require base conversion, so:
Calling strtol (twice) and snprintf (twice) -- adds *6 seconds.

and to format one of the items as an array (a strcpy and a strcat) --
add 1.5 seconds for a total of 11.5.

The only thing I have left are these statements:

get_call_result_type
TupleDescGetAttInMetadata
BuildTupleFromCStrings
HeapTupleGetDatum
and finally PG_RETURN_DATUM

It turns out that:
get_call_result_type adds 43 seconds [total: 54],
TupleDescGetAttInMetadata adds 19 seconds [total: 73],
BuildTypleFromCStrings accounts for 43 seconds [total: 116].

So those three functions account for 90% of the total time spent.
What alternatives exist? Do I have to call get_call_result_type /every
time/ through the function?

--
Jon

In response to

Re: copy vs. C function at 2011-12-12 16:38:04 from Merlin Moncure

Responses

Re: copy vs. C function at 2011-12-14 06:18:11 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Pavel Stehule	2011-12-13 14:42:58	Re: Postgres array parser
Previous Message	Aleksej Trofimov	2011-12-13 14:28:32	Re: Postgres array parser