Re: pg_stat_statements oddity with track = all

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Sergei Kornilov <sk(at)zsrv(dot)org>, legrand legrand <legrand_legrand(at)hotmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stat_statements oddity with track = all
Date: 2021-03-16 11:55:45
Message-ID: CABUevEzh_cpce8ptrOeAxakEKbiQHwwGAXYGMpezfEuK2XQ2Yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 9, 2021 at 3:39 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> On Mon, Mar 08, 2021 at 06:03:59PM +0100, Magnus Hagander wrote:
> > On Sun, Mar 7, 2021 at 8:39 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > >
> > > Yes I was a bit worried about the possible extra text entry. I kept things
> > > simple until now as the general opinion was that entries existing for both top
> > > level and nested level should be very rare so adding extra code and overhead to
> > > spare a few query texts wouldn't be worth it.
> > >
> > > I think that using a flag might be a bit expensive, as we would have to make
> > > sure that at least one of the possible two entries has it. So if there are 2
> > > entries and the one with the flag is evicted, we would have to transfer the
> > > flag to the other one, and check the existence of the flag when allocatin a new
> > > entry. And all of that has to be done holding an exclusive lock on pgss->lock.
> >
> > Yeah, we'd certainly want to minimize things. But what if they both
> > have the flag at that point? Then we wouldn't have to care on
> > eviction? But yes, for new allications we'd have to look up if the
> > query existed with the other value of the flag, and copy it over in
> > that case.
>
> I think that we might be able to handle that without a flag. The only thing
> that would need to be done is when creating an entry, look for an existing
> entry with the opposite flag, and if there's simply use the same
> (query_offset, query_len) info. This doesn't sound that expensive.

That's basically what I was trying to say :)

> The real pain point will be that the garbage collection phase
> will become way more expensive as it will now have to somehow maintain that
> knowledge, which will require additional lookups for each entry. I'm a bit
> concerned about that, especially with the current heuristic to schedule garbage
> collection. For now, need_qc_qtext says that we have to do it if the extent is
> more than 512 (B) * pgss_max. This probably doesn't work well for people using
> ORM as they tend to generate gigantic SQL queries.

Right, the cost would be mostly on the GC side. I've never done any
profiling to see how big of a thing that is in systems today -- have
you?

> If we implement query text deduplication, should we add another GUC for that
> "512" magic value so that people can minimize the gc overhead if they know they
> have gigantic queries, or simply don't mind bigger qtext file?
>
> > > Maybe having a new hash table (without the toplevel flag) for those query text
> > > might be better, or maybe pgss performance is already so terrible when you have
> > > to regularly evict entries that it wouldn't make any real difference.
> > >
> > > Should I try to add some extra code to make sure that we only store the query
> > > text once, or should I document that there might be duplicate, but we expect
> > > that to be very rare?
> >
> > If we expect it to be rare, I think it might be reasonable to just
> > document that. But do we really have a strong argument for it being
> > rare?
>
> I don't that think that anyone really had a strong argument, mostly gut
> feeling. Note that pg_stat_kcache already implemented that toplevel flags, so
> if people are using that extension in a recent version they might have some
> figures to show. I'll ping some people that I know are using it.

Great -- data always wins over gut feelings :)

> One good argument would be that gigantic queries generated by ORM should always
> be executed as top level statements.

Yes, that's true. And it probably holds as a more generic case as
well, that is the queries that are likely to show up both top-level
and lower-level are more likely to be relatively simple ones. (Except
for example during the development of functions/procs where they're
often executed top level as well to test etc, but that's not the most
important case to optimize for)

> I previously tried with the postgres regression tests, which clearly isn't a
> representative workload, and as far as I can see the vast majority of queries
> executed bost as top level and nested level are DDL implying recursion (e.g. a
> CREATE TABLE with underlying index creation).

I think the PostgreSQL regression tests are so far from a real world
workload that the input in this case has a value of exactly zero.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2021-03-16 11:57:12 Re: pg_subscription - substream column?
Previous Message Bharath Rupireddy 2021-03-16 11:51:19 Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<<postmaster_pid>>)?