Re: Slowness of extended protocol

From: Shay Rojansky <roji(at)roji(dot)org>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, sfrost(at)snowman(dot)net, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, stark(at)mit(dot)edu, ishii(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Slowness of extended protocol
Date: 2016-08-15 12:33:33
Message-ID: CADT4RqBtP0Yzhdhc2MB5YRSP3k4+rT1KX7yNV4rTV4xq3xp-sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Apologies, I accidentally replied off-list, here's the response I sent.
Vladimir, I suggest you reply to this message with your own response...

On Sat, Aug 13, 2016 at 6:32 PM, Vladimir Sitnikov <
sitnikov(dot)vladimir(at)gmail(dot)com> wrote:

> Shay>To be honest, the mere idea of having an SQL parser inside my driver
> makes me shiver.
>
> Same for me.
> However I cannot wait for PostgreSQL 18 that does not need client-side
> parsing.
>

I'm glad we agree on something. For me the problem has nothing to do with
PostgreSQL or performance - it has to do with database APIs that impose
uniform parameter placeholder formats, and therefore force drivers to
rewrite user SQL queries. AFAIK rewriting a user's SQL for optimization is
totally out of the scope of the driver's work.

Shay>There's nothing your driver is doing that the application developer
> can't do themselves -
> Shay>so your driver isn't faster than other drivers. It's faster only when
> used by lazy programmers.
>
> I'm afraid you do not get the point.
> ORMs like Hibernate, EclipseLink, etc send regular "insert ... values" via
> batch API.
> For the developer the only way to make use of "multivalues" is to
> implement either "ORM fix" or "the driver fix" or "postgresql fix".
>
So the feature has very little to do with laziness of the programmers.
> Application developer just does not have full control of each SQL when
> working though ORM.
> Do you suggest "stop using ORMs"? Do you suggest fixing all the ORMs so it
> uses optimal for each DB insert statement?
> Do you suggest fixing postgresql?
>
> Once again "multivalues rewrite at pgjdbc level" enables the feature
> transparently for all the users. If PostgreSQL 10/11 would improve
> bind/exec performance, we could even drop that rewrite at pgjdbc level and
> revert to the regular flow. That would again be transparent to the
> application.
>
>
I do get the point, and in fact I myself mentioned the ORM case above as an
advantage of implicit query preparation.

First, there's nothing stopping an ORM from optimizing multiple inserts
into a single multivalue insert. I do admit I'm not aware of any who do
this, but it's a good idea for an optimization - I happen to maintain the
Entity Framework Core provider for Npgsql, I might take a look at this
optimization (so again thanks for the idea).

Second, it's well-known that using an ORM almost always implies a
performance sacrifice - it's a tradeoff that's chosen when going with an
ORM. It's great that you optimize multiple inserts, but there are a myriad
of other cases where an ORM generates less efficient SQL that what would be
possible - but I don't think it makes sense for the driver to actually
contain an SQL optimizer. Slightly worse performance isn't in itself a
reason to drop ORMs: it's frequent practice to drop down to raw SQL for
performance-critical operations, etc.

But all that isn't really important - I'm going to repeat what I said
before and it would be good to get some reaction to this. Every software
component in the stack has a role, and maintaining those separations is
what keeps things simple and sane. Just for fun, we could imagine a kernel
network-stack feature which analyzes outgoing messages and optimizes them;
we could even implement your multiple insert -> multivalue insert
optimization there. This would have the advantage of working out-of-the-box
for every driver and every language out there (just like your driver does
provides it transparently for all ORMs) But nobody in their right mind
would think of doing something like this, and for good reason.

The programmer's task is to write SQL, the driver's task is to communicate
that SQL via the database-specific protocol, the kernel's networking
stack's task is to transmit that protocol via TCP, etc. etc. If an ORM is
used, the programmer effectively outsources the task of writing SQL to
another component, which is supposed to do a good job about it. Once you go
about blurring all the lines here, everything becomes more complicated,
brittle and hard to debug.

For what it's worth, I can definitely imagine your kind of optimizations
occurring at some additional layer which the user would choose to use - an
intermediate SQL optimizer between application code (or ORM) and the
driver. This "SQL optimizer" layer would keep the driver itself lean and
simple (for users who *don't* want all the magic), while allowing for
transparent optimizations for ORMs. Or if the magic is implemented at the
driver leve, it should be opt-in, or at least easy to disable entirely.

>

> Shay>are you sure there aren't "hidden" costs on the PostgreSQL side for
> generating so many implicit savepoints?
>
> Technically speaking I use the same savepoint name through bind/exec
> message.
>

Out of curiosity, I whipped up a quick benchmark (voltmeter) of the impact
of adding a savepoint before every command. Each iteration creates a
transaction, sends one Bind/Execute for "SELECT 1" (which was prepared
during setup), then sends one Bind/Execute for "SAVEPOINT x; SELECT 1",
then commits. The baseline simply sends "SELECT 1" twice (two Bind/Execute
roundtrips) in the transaction. Here are the results against localhost (PG
9.5):

Method | Median | StdDev | Scaled | Op/s |
-------- |------------ |---------- |------- |-------- |
With | 268.0387 us | 9.4800 us | 1.06 | 3678.32 |
Without | 252.9685 us | 8.3573 us | 1.00 | 3896.9 |

The benchmark source code is https://gist.github.com/roji/
f5bf39bfc759503392e50585ececcfc0. So the several-microsecond difference
translates to a 6% performance degradation. This is probably due to the
slowness of processing the extra Bind/Execute pair for the savepoint, which
maybe can be improved in PostgreSQL. Regardless, you may say there's no
business case or that SELECT 1 against localhost means nothing, or that
posting data is somehow wrong, but I don't find this very acceptable.

Shay>What you're doing is optimizing developer code, with the assumption
> that developers can't be trusted to code efficiently - they're going to
> write bad SQL and forget to prepare their statements
>
> Please, be careful. "you are completely wrong here" he-he. Well, you list
> the wrong assumption. Why do you think my main assumption is "developers
> can't be trusted"?
>

Because you said so above (Wed, Aug 10, 2016 at 8:37 PM):

> My experience shows, that people are very bad at predicting where the
performance problem would be.
> For 80% (or even 99%) of the cases, they just do not care thinking if a
particular statement should be server-prepared or not.
> They have absolutely no idea how much resources it would take and so on.

Maybe you're even right saying these things, I don't know. But that doesn't
mean I as a driver should solve their problems for them. And I also get
that you have an additional argument here besides programmer
laziness/stupidity - the ORM argument - which makes more sense.

The proper assumption is: I follow Java database API specification, and I
> optimize pgjdbc for the common use case (e.g. ORM or ORM-like).
>

Do you have any evidence that ORM or ORM-like is the common use case?

> For instance, if Java application wants to use bind values (e.g. to
> prevent security issues), then the only way is to go through
> java.sql.PreparedStatement.
>
> Here's the documentation: https://docs.oracle.com/javase/8/docs/api/jav
> a/sql/Connection.html#prepareStatement-java.lang.String-
>
> Here's a quote:
> Javadoc> *Note:* This method is optimized for handling parametric SQL
> statements that benefit from precompilation. If the driver supports
> precompilation, the methodprepareStatement will send the statement to the
> database for precompilation. Some drivers may not support precompilation.
> In this case, the statement may not be sent to the database until the
> PreparedStatement object is executed. This has no direct effect on users;
> however, it does affect which methods throw certainSQLException objects.
>
> The most important part is "if the driver supports precompilation..."
> There's no API to enable/disable precompilation at all.
> So, when using Java, there is no such thing as
> "statement.enableServerPrepare=true".
>
> It is expected, that "driver" would "optimize" the handling somehow in the
> best possible way.
>

What? I really didn't understand your point here. All the doc is saying is
that if the driver doesn't support prepared statements, then using them
wouldn't provide any benefit etc. The very fact that the Java API *has* a
PreparedStatement class means that statement preparation is exposed to the
developer - the API provides methods to the developer, allowing them to
choose *themselves* what to prepare. You chose to implicitly prepare
statements even if PreparedStatement *isn't* used, you really can't claim
that this is expected behavior (and indeed, most drivers I know don't do
it).

> It is Java API specification that enables me (as a driver developer) to be
> flexible, and leverage database features so end user gets best experience.
>

In my opinion, the API neither enables nor prohibits what you're doing. But
again, the very fact that the API includes PreparedStatement shows that the
JDBC considers preparation a programmer concern/decision and not an
implicit driver decision.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-08-15 12:42:51 Re: WIP: Barriers
Previous Message Robert Haas 2016-08-15 12:32:48 Re: [parallel query] random server crash while running tpc-h query on power2