Re: shared memory stats: high level design decisions: consistency, dropping

From: Hannu Krosing <hannuk(at)google(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: shared memory stats: high level design decisions: consistency, dropping
Date: 2021-03-20 00:16:31
Message-ID: CAMT0RQQj0EfkRoHLDbZ56C=az0r5VK8Am_S9GeRfLDqkMK1OYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> But now we could instead schedule stats to be removed at commit
time. That's not trivial of course, as we'd need to handle cases where
the commit fails after the commit record, but before processing the
dropped stats.

We likely can not remove them at commit time, but only after the
oldest open snapshot moves parts that commit ?

Would an approach where we keep stats in a structure logically similar
to MVCC we use for normal tables be completely unfeasible ?

We would only need to keep one Version per backend in transaction .

---
Hannu

On Sat, Mar 20, 2021 at 12:51 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
> the goal of getting it into a shape that I'd be happy to commit. That
> thread is quite long and most are probably skipping over new messages in
> it.
>
> There are two high-level design decisions / questions that I think
> warrant a wider audience (I'll keep lower level discussion in the other
> thread).
>
> In case it is not obvious, the goal of the shared memory stats patch is
> to replace the existing statistics collector, to which new stats are
> reported via an UDP socket, and where clients read data from the stats
> collector by reading the entire database's stats from disk.
>
> The replacement is to put the statistics into a shared memory
> segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity,
> etc) being stored directly in a struct in memory. Stats for objects
> where a variable number exists, e.g. tables, are addressed via a dshash.
> table that points to the stats that are in turn allocated using dsa.h.
>
>
> 1) What kind of consistency do we want from the pg_stats_* views?
>
> Right now the first access to stats in a transaction will trigger a read
> of both the global and per-database stats from disk. If the on-disk
> state is too old, we'll ask the stats collector to write out a new file
> a couple times.
>
> For the rest of the transaction that in-memory state is used unless
> pg_stat_clear_snapshot() is called. Which means that multiple reads from
> e.g. pg_stat_user_tables will show the same results as before [2].
>
> That makes stats accesses quite expensive if there are lots of
> objects.
>
> But it also means that separate stats accesses - which happen all the
> time - return something repeatable and kind of consistent.
>
> Now, the stats aren't really consistent in the sense that they are
> really accurate, UDP messages can be lost, or only some of the stats
> generated by a TX might not yet have been received, other transactions
> haven't yet sent them. Etc.
>
>
> With the shared memory patch the concept of copying all stats for the
> current database into local memory at the time of the first stats access
> doesn't make sense to me. Horiguchi-san had actually implemented that,
> but I argued that that would be cargo-culting an efficiency hack
> required by the old storage model forward.
>
> The cost of doing this is substantial. On master, with a database that
> contains 1 million empty tables, any stats access takes ~0.4s and
> increases memory usage by 170MB.
>
>
> 1.1)
>
> I hope everybody agrees with not requiring that stats don't need to be
> the way they were at the time of first stat access in a transaction,
> even if that first access was to a different stat object than the
> currently accessed stat?
>
>
> 1.2)
>
> Do we expect repeated accesses to the same stat to stay the same through
> the transaction? The easiest way to implement stats accesses is to
> simply fetch the stats from shared memory ([3]). That would obviously
> result in repeated accesses to the same stat potentially returning
> changing results over time.
>
> I think that's perfectly fine, desirable even, for pg_stat_*.
>
>
> 1.3)
>
> What kind of consistency do we expect between columns of views like
> pg_stat_all_tables?
>
> Several of the stats views aren't based on SRFs or composite-type
> returning functions, but instead fetch each stat separately:
>
> E.g. pg_stat_all_tables:
> SELECT c.oid AS relid,
> n.nspname AS schemaname,
> c.relname,
> pg_stat_get_numscans(c.oid) AS seq_scan,
> pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
> sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
> sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
> pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
> ...
> pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
> FROM pg_class c
> LEFT JOIN pg_index i ON c.oid = i.indrelid
> ...
>
> Which means that if we do not cache stats, additional stats updates
> could have been applied between two stats accessors. E.g the seq_scan
> from before some pgstat_report_stat() but the seq_tup_read from after.
>
> If we instead fetch all of a table's stats in one go, we would get
> consistency between the columns. But obviously that'd require changing
> all the stats views.
>
> Horiguchi-san, in later iterations of the patch, attempted to address
> this issue by adding a one-entry caches below
> pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
> what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.
>
>
> But I think that leads to very confusing results. Access stats for the
> same relation multiple times in a row? Do not see updates. Switch
> between e.g. a table and its indexes? See updates.
>
>
> I personally think it's fine to have short-term divergences between the
> columns. The stats aren't that accurate anyway, as we don't submit them
> all the time. And that if we want consistency between columns, we
> instead should replace the current view definitions with[set of] record
> returning function - everything else seems to lead to weird tradeoffs.
>
>
>
> 2) How to remove stats of dropped objects?
>
> In the stats collector world stats for dropped objects (tables, indexes,
> functions, etc) are dropped after the fact, using a pretty expensive
> process:
>
> Each autovacuum worker cycle and each manual VACUUM does
> pgstat_vacuum_stat() to detect since-dropped objects. It does that by
> building hash-tables for all databases, tables and functions, and then
> comparing that against a freshly loaded stats snapshot. All stats object
> not in pg_class etc are dropped.
>
> The patch currently copies that approach, although that adds some
> complications, mainly around [3].
>
>
> Accessing all database objects after each VACUUM, even if the table was
> tiny, isn't great, performance wise. In a fresh master database with 1
> million functions, a VACUUM of an empty table takes ~0.5s, with 1
> million tables it's ~1s. Due to partitioning tables with many database
> objects are of course getting more common.
>
>
> There isn't really a better approach in the stats collector world. As
> messages to the stats collector can get lost, we need to to be able to
> re-do dropping of dead stats objects.
>
>
> But now we could instead schedule stats to be removed at commit
> time. That's not trivial of course, as we'd need to handle cases where
> the commit fails after the commit record, but before processing the
> dropped stats.
>
> But it seems that integrating the stats that need to be dropped into the
> commit message would make a lot of sense. With benefits beyond the
> [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming
> replication and even opening the door to keep stats across crashes.
>
>
> My gut feeling here is to try to to fix the remaining issues in the
> "collect oids" approach for 14 and to try to change the approach in
> 15. And, if that proves too hard, try to see how hard it'd be to
> "accurately" drop. But I'm also not sure - it might be smarter to go
> full in, to avoid introducing a system that we'll just rip out again.
>
>
> Comments?
>
>
> Greetings,
>
> Andres Freund
>
> [1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp
>
> [2] Except that new tables with show up with lots of 0s
>
> [3] There is a cache to avoid repeated dshash lookups for
> previously-accessed stats, to avoid contention. But that just points
> to the shared memory area with the stats.
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2021-03-20 00:16:59 Re: [HACKERS] Custom compression methods
Previous Message Bruce Momjian 2021-03-20 00:10:54 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?