Quick Links

Re: should INSERT SELECT use a BulkInsertState?

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Luc Vlaming <luc(at)swarm64(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: should INSERT SELECT use a BulkInsertState?
Date:	2020-11-24 02:00:20
Message-ID:	20201124020020.GK24052@telsasoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Nov 02, 2020 at 12:45:51PM -0600, Justin Pryzby wrote:
> On Mon, Nov 02, 2020 at 07:53:45AM +0100, Luc Vlaming wrote:
> > On 30.10.20 05:51, Justin Pryzby wrote:
> > > On Thu, Oct 22, 2020 at 01:29:53PM +0100, Simon Riggs wrote:
> > > > On Fri, 16 Oct 2020 at 22:05, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > >
> > > > > > > I made this conditional on BEGIN BULK/SET bulk, so I'll solicit comments on that.
> > > >
> > > > I think it would be better if this was self-tuning. So that we don't
> > > > allocate a bulkinsert state until we've done say 100 (?) rows
> > > > inserted.
> > >
> > > I made it an optional, non-default behavior in response to the legitimate
> > > concern for performance regression for the cases where a loader needs to be as
> > > fast as possible - as compared with our case, where we want instead to optimize
> > > for our reports by making the loaders responsible for their own writes, rather
> > > than leaving behind many dirty pages, and clobbering the cache, too.
> > >
> > > Also, INSERT SELECT doesn't immediately help us (telsasoft), since we use
> > > INSERT .. VALUES () .. ON CONFLICT. This would handle that case, which is
> > > great, even though that wasn't a design goal. It could also be an integer GUC
> > > to allow configuring the size of the ring buffer.
> > >
> > > > You should also use table_multi_insert() since that will give further
> > > > performance gains by reducing block access overheads. Switching from
> > > > single row to multi-row should also only happen once we've loaded a
> > > > few rows, so we don't introduce overahads for smaller SQL statements.
> > >
> > > Good idea...multi_insert (which reduces the overhead of individual inserts) is
> > > mostly independent from BulkInsert state (which uses a ring-buffer to avoid
> > > dirtying the cache). I made this 0002.
> > >
> > > This makes INSERT SELECT several times faster, and not clobber the cache too.

- Rebased on Heikki's copy.c split;
- Rename structures without "Copy" prefix;
- Move MultiInsert* from copyfrom.c to (tentatively) nodeModifyTable.h;
- Move cur_lineno and transition_capture into MultiInsertInfo;

This switches to multi insert after a configurable number of tuples.
If set to -1, that provides the historic behavior that bulk inserts
can leave behind many dirty buffers. Perhaps that should be the default.

I guess this shouldn't be in copy.h or in commands/* at all.
It'll be included by both: commands/copyfrom_internal.h and
executor/nodeModifyTable.h. Maybe it should go in util or lib...
I don't know how to do it without including executor.h, which seems
to be undesirable.

--
Justin

Attachment	Content-Type	Size
v6-0001-Allow-INSERT-SELECT-to-use-a-BulkInsertState.patch	text/x-diff	9.0 KB
v6-0002-Make-INSERT-SELECT-use-multi_insert.patch	text/x-diff	41.0 KB
v6-0003-Dynamically-switch-to-multi-insert-mode.patch	text/x-diff	7.2 KB

In response to

Re: should INSERT SELECT use a BulkInsertState? at 2020-11-02 18:45:51 from Justin Pryzby

Responses

Re: should INSERT SELECT use a BulkInsertState? at 2020-11-30 01:11:46 from Justin Pryzby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2020-11-24 02:03:32	Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)
Previous Message	Michael Paquier	2020-11-24 01:57:40	Re: abstract Unix-domain sockets