From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Gripes about walsender command processing |
Date: | 2020-09-13 19:47:51 |
Message-ID: | 880181.1600026471@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While trying to understand a recent buildfarm failure [1], I got annoyed
by the fact that a walsender exposes its last SQL command in
pg_stat_activity even when that was a long time ago and what it's been
doing since then is replication commands. This leads to *totally*
misleading reports, for instance in [1] we read
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:4] LOG: server process (PID 2368853) was terminated by signal 11: Segmentation fault
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:5] DETAIL: Failed process was running: SELECT pg_catalog.set_config('search_path', '', false);
even though the process's own log messages paint a much better picture of
reality:
2020-09-13 19:10:07.302 CEST [5f5e526f.242555:1] LOG: connection received: host=[local]
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:2] LOG: replication connection authorized: user=bf application_name=sub2
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:3] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 19:10:07.334 CEST [5f5e526f.242555:4] LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:5] LOG: received replication command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '2', publication_names '"pub2"')
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:6] LOG: starting logical decoding for slot "sub2"
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:7] DETAIL: Streaming transactions committing after 0/159EB38, reading WAL from 0/159EB00.
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:8] LOG: logical decoding found consistent point at 0/159EB00
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:9] DETAIL: There are no running transactions.
I think that walsenders ought to report their current replication command
as though it were a SQL command. They oughta set debug_query_string, too.
While trying to fix this, I also observed that exec_replication_command
fails to clean up the temp context it made for parsing the command string,
if that turns out to be a SQL command. This very accidentally fails to
lead to a long-term memory leak, because that context will be a child of
MessageContext so it'll be cleaned out in the next iteration of
PostgresMain's main loop. But it's still unacceptably sloppy coding
IMHO, and it's closely related to a lot of other randomness in the
function; it'd be better to have a separate early-exit path for SQL
commands.
Attached find a proposed fix for these things.
What I'd really like to do is adjust pgstat_report_activity so that
callers are *required* to provide some kind of command string when
transitioning into STATE_RUNNING state, ie something like
Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);
However, before we could do that, we'd have to clean up the rat's nest
of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity
calls in replication/logical/worker.c. I couldn't make heads or tails
of what is going on there, so I did not try.
BTW, while I didn't change it here, isn't the second
SnapBuildClearExportedSnapshot call in exec_replication_command just
useless? We already did that before parsing the command.
regards, tom lane
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
Attachment | Content-Type | Size |
---|---|---|
clean-up-exec_replication_command.patch | text/x-diff | 3.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Brar Piening | 2020-09-13 20:25:01 | Minor documentation error regarding streaming replication protocol |
Previous Message | David G. Johnston | 2020-09-13 19:36:26 | Re: Probable documentation errors or improvements |