From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Force lookahead in COPY FROM parsing |
Date: | 2021-04-01 20:47:37 |
Message-ID: | bf73c57e-b2e3-9df6-4412-6ca552bc64a6@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 18/03/2021 17:09, John Naylor wrote:
> The cfbot thinks this patch no longer applies, but it works for me, so
> still set to RFC. It looks like it's because the thread to remove the v2
> FE/BE protocol was still attached to the commitfest entry. I've deleted
> that, so let's see if that helps.
It was now truly bitrotted by commit f82de5c46b (the encoding conversion
changes). Rebase attached.
> To answer the side question of whether it makes any difference in
> performance, I used the blackhole AM [1] to isolate the copy code path
> as much as possible. Forcing lookahead seems to not make a noticeable
> difference (min of 5 runs):
>
> master:
> 306ms
>
> force lookahead:
> 304ms
>
> I forgot to mention the small detail of what the test was:
>
> create extension blackhole_am;
> create table blackhole_tab (a text) using blackhole_am ;
> copy blackhole_tab from program 'for i in {1..100}; do cat /path/to/UTF-8\ Sampler.htm ; done;' ;
>
> ...where the .htm file is at http://kermitproject.org/utf8.html
Ok, I wouldn't expect to see much difference in that test, it gets
drowned in all the other parsing overhead. I tested this now with this:
copy (select repeat('x', 10000) from generate_series(1, 100000)) to
'/tmp/copydata-x.txt'
create table blackhole_tab (a text);
copy blackhole_tab from '/tmp/copydata-x.txt' where false;
I took the min of 5 runs of that COPY FROM statement:
master:
4107 ms
v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch:
3172 ms
I was actually surprised it was so effective on that test, I expected a
small but noticeable gain. But I'll take it.
Replying to your earlier comments (sorry for the delay):
> Looks good to me. Just a couple minor things:
>
> + * Look at the next character. If we're at EOF, c2 will wind
> + * up as '\0' because of the guaranteed pad of raw_buf.
> */
> - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
> -
> - /* get next char */
> c = copy_raw_buf[raw_buf_ptr];
>
> The new comment seems copy-pasted from the c2 statements further down.
Fixed.
> - if (raw_buf_ptr >= copy_buf_len || need_data)
> +#define COPY_READ_LINE_LOOKAHEAD 4
> + if (raw_buf_ptr + COPY_READ_LINE_LOOKAHEAD >= copy_buf_len)
>
> Is this #define deliberately put down here rather than at the top of the file?
Yeah, it's only used right here locally. Matter of taste, but I'd prefer
to keep it here.
> - * of the buffer and then we load more data after that. This case occurs only
> - * when a multibyte character crosses a bufferload boundary.
> + * of the buffer and then we load more data after that.
>
> Is the removed comment really invalidated by this patch? I figured it was something not affected until the patch to do the encoding conversion in larger chunks.
Not sure anymore, but this is moot now, since the other patch was committed.
Thanks for the review and the testing!
- Heikki
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch | text/x-patch | 8.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2021-04-01 21:15:20 | Re: OpenBSD versus semaphores |
Previous Message | Robert Haas | 2021-04-01 20:08:56 | Re: pg_amcheck contrib application |