Re: Improve handling of pg_stat_statements handling of bind "IN" variables

From: Pavel Trukhanov <pavel(dot)trukhanov(at)gmail(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve handling of pg_stat_statements handling of bind "IN" variables
Date: 2020-08-07 17:42:42
Message-ID: CAF42k=JoQysWd4FKkbUzkFJ117uWh6avW6qVBTsMzW8WztLNPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey, let me know if there's any way I can help.

I would argue that making even a small improvement here would be beneficial
to many.

On Tue, Jul 21, 2020 at 11:59 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
wrote:

> > On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov <
> pavel(dot)trukhanov(at)gmail(dot)com> wrote:
> >
> >> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>
> >> Greg Stark <stark(at)mit(dot)edu> writes:
> >> > Actually thinking about this for two more seconds the question is
> what it
> >> > would do with a query like
> >> > WHERE col = ANY '1,2,3'::integer[]
> >> > Or
> >> > WHERE col = ANY ARRAY[1,2,3]
> >> > Whichever the language binding that is failing to do parameterized
> queries
> >> > is doing to emulate them.
> >>
> >> Yeah, one interesting question is whether this is actually modeling
> >> what happens with real-world applications --- are they sending Consts,
> >> or Params?
> >>
> >> I resolutely dislike the idea of marking arrays that came from IN
> >> differently from other ones; that's just a hack and it's going to give
> >> rise to unexplainable behavioral differences for logically-equivalent
> >> queries.
> >
> > Thanks for your input.
> >
> > As for real-world applications – being a founder of a server monitoring
> saas
> > (okmeter) I have access to stats on hundreds of postgres installations.
> >
> > It shows that IN with a variable number of params is ~7 times more used
> than
> > ANY(array).
>
> Hi,
>
> I would like to do some archaeology and inquire about this thread, since
> unfortunately there was no patch presented as far as I see.
>
> IIUC the ideas suggested in this thread are evolving mostly about modifying
> parser:
>
> > On Fri, Jun 14, 2019 at 2:46 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > I do not think you need new expression infrastructure. IMO what's going
> on
> > here is that we're indulging in premature optimization in the parser. It
> > would be better from a structural standpoint if the output of parse
> analysis
> > were closer to what the user wrote, and then the business of separating
> Vars
> > from Consts and reducing the Consts to an array were handled in the
> planner's
> > expression preprocessing phase.
> >
> > So maybe what you should be thinking about is a preliminary patch that's
> > mostly in the nature of refactoring, to move that processing to where it
> > should be.
> >
> > Of course, life is never quite that simple; there are at least two
> > issues you'd have to think about.
> >
> > * The parsing phase is responsible for determining the semantics of
> > the query, in particular resolving the data types of the IN-list items
> > and choosing the comparison operators that will be used. The planner
> > is not allowed to rethink that. What I'm not clear about offhand is
> > whether the existing coding in parse analysis might lead to different
> > choices of data types/operators than a more straightforward approach
> > does. If so, we'd have to decide whether that's okay.
> >
> > * Postponing this work might make things slower overall, which wouldn't
> > matter too much for short IN-lists, but you can bet that people who
> > throw ten-thousand-entry IN-lists at us will notice. So you'd need to
> > keep an eye on efficiency and make sure you don't end up repeating
> > similar processing over and over.
>
> This puzzles me, since the original issue sounds like a "representation"
> problem, when we want to calculate jumble hash in a way that obviously
> repeating parameters or constants are hashed into one value. I see the
> point in
> ideas like this:
>
> >> One idea that comes to me after looking at the code involved is that
> >> it might be an improvement across-the-board if transformAExprIn were to
> >> reduce the generated ArrayExpr to an array Const immediately, in cases
> >> where all the inputs are Consts. That is going to happen anyway come
> >> plan time, so it'd have zero impact on semantics or query performance.
> >> Doing it earlier would cost nothing, and could even be a net win, by
> >> reducing per-parse-node overhead in places like the rewriter.
> >>
> >> The advantage for the problem at hand is that a Const that's an array
> >> of 2 elements is going to look the same as a Const that's any other
> >> number of elements so far as pg_stat_statements is concerned.
> >>
> >> This doesn't help of course in cases where the values aren't all
> >> Consts. Since we eliminated Vars already, the main practical case
> >> would be that they're Params, leading us back to the previous
> >> question of whether apps are binding queries with different numbers
> >> of parameter markers in an IN, and how hard pg_stat_statements should
> >> try to fuzz that if they are.
> >>
> >> Then, to Greg's point, there's a question of whether transformArrayExpr
> >> should do likewise, ie try to produce an array Const immediately.
> >> I'm a bit less excited about that, but consistency suggests that
> >> we should have it act the same as the IN case.
>
> Interestingly enough, something similar was already mentioned in [1]. But
> no
> one jumped into this, probably due to its relative complexity, lack of
> personal
> time resources or not clear way to handle Params (I'm actually not sure
> about
> the statistics for Consts vs Params myself and need to check this, but can
> easily imagine both could be an often problem).
>
> Another idea also was mentioned in [1]:
>
> > I wonder whether we could improve this by arranging things so that both
> > Consts and Params contribute zero to the jumble hash, and a list of these
> > things also contributes zero, regardless of the length of the list.
>
> Taking everything into account, is there anything particularly wrong about
> approach of squashing down lists of constants/parameters in
> pg_stat_statements
> itself? This sounds simpler, and judging from my experiments even
> preventing
> jumbling of ArrayExpr and rte values constants of the same type with a
> position
> index above some threshold will already help a lot in many cases that I
> observe.
>
> [1]:
> https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-08-07 18:00:34 Re: Parallel worker hangs while handling errors.
Previous Message Robert Haas 2020-08-07 17:39:21 Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks