Re: PATCH: Batch/pipelining support for libpq

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Shay Rojansky <roji(at)roji(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Manuel Kniep <m(dot)kniep(at)web(dot)de>, "fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: PATCH: Batch/pipelining support for libpq
Date: 2016-10-14 05:02:07
Message-ID: CAMsr+YEgnJ8ZAWPLx5=BCbYYq9SNTdwbwvUcb7V-vYm5d5uhbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12 October 2016 at 19:51, Shay Rojansky <roji(at)roji(dot)org> wrote:
> Hi all. I thought I'd share some experience from Npgsql regarding
> batching/pipelining - hope this isn't off-topic.

Not at all.

> Npgsql has supported batching for quite a while, similar to what this patch
> proposes - with a single Sync message is sent at the end.

PgJDBC does too. The benefits of it there are what prompted me to do
this, not only so libpq users can use it directly but so psqlODBC,
psycopg2, etc can benefit from it if they choose to expose it.

> It has recently come to my attention that this implementation is problematic
> because it forces the batch to occur within a transaction; in other words,
> there's no option for a non-transactional batch.

That's not strictly the case. If you explicitly BEGIN and COMMIT,
those operations are honoured within the batch.

What's missing is implicit transactions. Usually if you send a series
of messages they will each get their own implicit transaction. If you
batch them, the whole lot gets an implicit transaction. This is
because the PostgreSQL v3 protocol conflates transaction delineation
with protocol error recovery into a single Sync message type.

It's very similar to the behaviour of multi-statements, where:

psql -c "CREATE TABLE fred(x integer); DROP TABLE notexists;"

doesn't create "fred", but

psql -c "BEGIN; CREATE TABLE fred(x integer); COMMIT; BEGIN; DROP
TABLE notexists; COMMMIT;"

... does. And in fact it's for almost the same reason. They're sent as
a single SimpleQuery message by psql and split up client side, but the
effect is the same as two separate Query messages followed by a Sync.

It isn't simple to manage this client-side, because libpq doesn't know
whether a given command string may contain transaction delimiting
statements or not and can't reliably look for them without client-side
parsing of the SQL. So it can't dispatch its own BEGIN/COMMIT around
statements in a batch that it thinks might be intended to be
autocommit, and anyway that'd result in sending 3 queries for every 1
client query, which would suck.

If the mythical v4 protocol ever happens I'd want to split Sync into
two messages, one which is a protocol synchronisation message and
another that is a transaction delimiter. Or give it flags or whatever.

In the mean time:

> This can be a problem for
> several reasons: users may want to sent off a batch of inserts, not caring
> whether one of them fails (e.g. because of a unique constraint violation).
> In other words, in some scenarios it may be appropriate for later batched
> statements to be executed when an earlier batched statement raised an error.
> If Sync is only sent at the very end, this isn't possible.

Right, and that remains the case even with explicit transaction
delineation, because the first ERROR causes processing of all
subsequent messages to be skipped.

The design I have in libpq allows for this by allowing the client to
delimit batches without ending batch mode, concurrently consuming a
stream of multiple batches. Each endbatch is a Sync. So a client that
wants autocommit-like behavour can send a series of 1-query batches.

I think I'll need to document this a bit better since it's more subtle
than I properly explained.

> Another example
> of a problem (which actually happened) is that transactions acquire
> row-level locks, and so may trigger deadlocks if two different batches
> update the same rows in reverse order. Both of these issues wouldn't occur
> if the batch weren't implicitly batched.

Same solution as above.

> My current plan is to modify the batch implementation based on whether we're
> in an (explicit) transaction or not. If we're in a transaction, then it
> makes perfect sense to send a single Sync at the end as is being proposed
> here - any failure would cause the transaction to fail anyway, so skipping
> all subsequent statements until the batch's end makes sense. However, if
> we're not in an explicit transaction, I plan to insert a Sync message after
> each individual Execute, making non-transactional batched statements more or
> less identical in behavior to non-transactional unbatched statements. Note
> that this mean that a batch can generate multiple errors, not just one.

Yes, that's what I suggest, and basically what the libpq batch
interface does, though it expects the client to deal with the
transaction boundaries.

You will need to think hard about transaction boundaries as they
relate to multi-statements unless nPgSQL parses out each statement
from multi-statement strings like PgJDBC does. Otherwise a user can
sneak in:

somestatement; BEGIN; someotherstatement;

or

somestatement; CoMMiT; otherstatement;

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-10-14 05:12:26 Re: Change of extension name to new name
Previous Message Peter Eisentraut 2016-10-14 04:38:06 Re: Renaming of pg_xlog and pg_clog