From: | Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> |
---|---|
To: | |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: copy vs. C function |
Date: | 2011-12-13 14:29:52 |
Message-ID: | CAKuK5J0EzKnvAfNk=JKd2Mw3q25fguHKYwTQCkYvxPYA+TDKPw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
>> I was experimenting with a few different methods of taking a line of
>> text, parsing it, into a set of fields, and then getting that info
>> into a table.
>>
>> The first method involved writing a C program to parse a file, parse
>> the lines and output newly-formatted lines in a format that
>> postgresql's COPY function can use.
>> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
>> output new data to new file -- 4 seconds, COPY new file -- 10
>> seconds).
>>
>> The next approach I took was to write a C function in postgresql to
>> parse a single TEXT datum into an array of C strings, and then use
>> BuildTupleFromCStrings. There are 8 columns involved.
>> Eliding the time it takes to COPY the (raw) file into a temporary
>> table, this method took 120 seconds, give or take.
>>
>> The difference was /quite/ a surprise to me. What is the probability
>> that I am doing something very, very wrong?
>>
>> NOTE: the code that does the parsing is actually the same,
>> line-for-line, the only difference is whether the routine is called by
>> a postgresql function or by a C program via main, so obviously the
>> overhead is elsewhere.
>> NOTE #2: We are talking about approximately 2.6 million lines.
>
>
> Let me throw out an interesting third method I've been using to parse
> delimited text files that might be useful in your case. This is
> useful when parsing text that is bad csv where values are not escaped
> or there are lines, incomplete and/or missing records, or a huge
> amount of columns that you want to rotate into a more normalized
> structure based on columns position.
>
> 1. Import the data into a single column (containing the entire line)
> staging table, feeding the COPY parser a bogus delimiter
> 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
> 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
> you can work it into your problem, INSERT/SELECT, expanding the array
> with a trick like used in information_schema._pg_expandarray so you
> can hook logic on the array (column position).
If you replace [2] with my C function (which can process all of the
data, *postgresql overhead not included*, in about 1 second) then
that's what I did. It returns a composite type making [3] unnecessary.
I know it's not parsing, so I started a time honored debugging
approach: I returned early.
Is the function-call overhead that high? That's outrageously high.
What else could it be? Is returning a composite type outragously
expensive?
So here is what I did: I modified the code so that it immediately returns NULL.
Result: 2 seconds.
Extract arguments, allocate temporary work buffer: another 0.5 seconds.
Add parsing: another 1.5 seconds [total: 4.1 seconds]
and so on...
Two of the items require base conversion, so:
Calling strtol (twice) and snprintf (twice) -- adds *6 seconds.
and to format one of the items as an array (a strcpy and a strcat) --
add 1.5 seconds for a total of 11.5.
The only thing I have left are these statements:
get_call_result_type
TupleDescGetAttInMetadata
BuildTupleFromCStrings
HeapTupleGetDatum
and finally PG_RETURN_DATUM
It turns out that:
get_call_result_type adds 43 seconds [total: 54],
TupleDescGetAttInMetadata adds 19 seconds [total: 73],
BuildTypleFromCStrings accounts for 43 seconds [total: 116].
So those three functions account for 90% of the total time spent.
What alternatives exist? Do I have to call get_call_result_type /every
time/ through the function?
--
Jon
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2011-12-13 14:42:58 | Re: Postgres array parser |
Previous Message | Aleksej Trofimov | 2011-12-13 14:28:32 | Re: Postgres array parser |