Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core
Date: 2014-01-22 04:01:51
Message-ID: CAM3SWZSySkp2z44C_BFthC9rDr10PTF7KjL80nD-Sr=Xv-XjZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 21, 2014 at 6:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> BTW, what is the value of using posix_fadvise() in warm_fcache?
>
> ISTM that if you're worried about waiting for I/O while holding the
> LWLock (as I concur you should be), this coding is entirely inadequate
> for preventing that, as the whole point of posix_fadvise is that it
> won't wait around for the I/O to happen. It strains credulity to
> the breaking point to imagine that the kernel will have gotten
> around to reading the file in the small number of nanoseconds that
> will elapse before you start needing the data. Unless of course the file
> is already in buffers, but then the whole thing was a waste of code.

Not necessarily. Under Linux for example, POSIX_FADV_SEQUENTIAL "sets
the readahead window to the default size for the backing device". That
seems like a useful thing, though I will admit that it's hard to
quantify how useful here. If it is useful, and furthermore if it's
more useful than just reading the file into a temp buffer (which, I
admit, is not clear), then why not do it where we can? Adopting
support for the non-portable posix_fadvise is a sunk cost.

> I think we should flush the posix_fadvise() code path as being neither
> useful nor portable, and just do the actual read() as in the other
> path

Perhaps.

> Actually, I think the whole thing is rather badly designed, as warm
> cache or no you're still proposing to do thousands of kernel calls
> while holding a potentially contended LWLock. I suggest what we
> do is (1) read in the whole file, (2) acquire the lock, (3)
> re-read the whole file in the same buffer, (4) work from the buffer.

fseeko() and fread() are part of the standard library, not any kernel.
I don't believe that what I've done in qtext_fetch() implies thousands
of kernel calls, or thousands of context switches.

Now, using a buffer not managed by libc might be better, but I should
note what holding that LWLock actually represents. As you know, it's a
shared lock that does not block the updating of existing entries. It
will only block the creation of entirely new ones. That ought to be
very rare. You may recall that when I worked with you on
pg_stat_statements in the past, I characterized the rate of growth as
being logarithmic. Once the system has been running for half an hour,
new entries become very infrequent indeed. Admittedly, I may be
failing to consider a worst case there, but in the average case this
works fine. Encouraging automated tools to lazily retrieve query texts
is a big part of why I thought this might be the least worst thing.

It's not obvious to me that always avoiding blocking new entries while
performing physical I/O is important enough to warrant the additional
complexity of copying to a buffer first, and mostly working off that.
Also, you must consider that there may have been a garbage collection
in the interim, so you have to get the shared lock twice (to check the
garbage collection count, to later guard against your buffer having
been invalidated by a garbage collection) instead of just getting the
shared lock once. I discussed this with Pavel, actually. Getting the
shared lock twice on every pg_stat_statements() call doesn't seem very
good either. Especially since you'll have to open the file twice,
*with* the shared lock second time around, when your scheme misses new
entries. Your scheme won't work out under exactly the same
circumstances that mine doesn't do too well, which in when there is a
new entry to consider. When that doesn't happen, holding a shared lock
is not that important.

> It might be possible to optimize the re-read away in the common case
> that the file didn't actually change since we started to read it;
> we'd need to put a bit more reporting into qtext_store probably, so
> that it's possible to tell if any space is reserved but still unwritten.

So, to be clear, you'd still be reading from the file if that proved
necessary? I think you'll have to, because it seems useful that
pg_stat_statements offers a limited form of consistency, in that you
only see entries at a point in time, even if the figures themselves
are not exactly consistent. Maybe I've missed something, but it's not
obvious to me that you're any better off as compared to just using a
FILE/stream.

> BTW shouldn't there be an fflush() in qtext_store?

I don't think so, no. Whenever qtext_store() callers don't fflush()
before releasing their exclusive lock, they FreeFile() before doing so
instead. In the case of pgss_shmem_startup() there is no lock held
anyway, because it doesn't matter, for the same reason it doesn't
matter that we're modifying shmem structures in a way that ordinarily
necessitates an exclusive lock. Why fflush() every qtext_store() call
if there is expected to be more than one, as is the case for several
callers?

FreeFile() calls fclose(). fclose() is described by the standard as
flushing its stream. It is my understanding that calling fflush() on a
stream immediately prior to calling fclose() on the same stream is
always superfluous.

> We could also eliminate some palloc/pfree cycles by using the string
> directly in the buffer (at the small cost of needing to include the
> strings' trailing nulls in the file).

It might be useful to do that anyway, for sanity checking purposes. So
+1 from me.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2014-01-22 04:05:28 Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core
Previous Message Amit Kapila 2014-01-22 03:55:38 Re: ALTER SYSTEM SET typos and fix for temporary file name management