Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
Date: 2023-01-05 10:57:00
Message-ID: CAK-MWwRxnjLTTTkM_muvBrHMoYYbT5xYFm=5cML9Q+SfRAwkXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Dec 31, 2022 at 2:26 PM Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com> wrote:

> Hi,
>
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
>
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.
> (
> grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration'
> | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
> ...
> 9 2022-12-29 00:12:5
> 2 2022-12-29 00:13:1
> 3 2022-12-29 00:13:5
> !!! 68 2022-12-29 00:14:0
> 5 2022-12-29 00:14:1
> 3 2022-12-29 00:14:2
> 2 2022-12-29 00:14:3
> ).
> But no lock waits on the BIND stage logged during the problem period (and
> no lock waits in general).
> Similar issues happen a few times per day without any visible pattern (but
> on the same tables usually).
> No CPU or IO load/latency spikes found during problem periods.
> No EXECUTE slowdown found in the log during that time.
>

Followup research of this issue lead me to following results:
Every logged spike of BIND/PARSE response time correlated with
corresponding backend waiting on
wait_event_type = LWLock
wait_event = pg_stat_statements
and all of these spikes happen during increment of
pg_stat_statements_info.dealloc counter.

Some searching about this issue lead me to following blog post about
similar issue:
https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html

However, we already have pg_stat_statements.max=10000 so further increase
of this parameter
seems counterproductive (the size of
14/main/pg_stat_tmp/pgss_query_texts.stat is already over 20MB).

Open questions remains:
1)Is it expected behaviour of pg_stat_statements to block every BIND/PARSE
during deallocation of least used entries for the whole period of cleanup?

2)Any recommended workaround for this issue for systems with strict latency
SLA
(block every database query (used extended query protocol) for 200-500ms
50+ times per day at random time - isn't acceptable for our case
unfortunately)?

3)Why only BIND/PARSE locks but not EXECUTE?
(may be some difference in implementation of plan vs exec
pg_stat_statements counters?).

Kind Regards,
Maxim

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message MichaelDBA 2023-01-05 11:31:46 Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
Previous Message Maxim Boguk 2023-01-01 17:06:49 Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)