Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
Date: 2023-01-01 11:34:50
Message-ID: CAK-MWwTRVNoDnjeKh-03cpubS9D1hhY-UofwYQChtLM2Y1EwWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:

> On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> > Hi,
> >
> > When performing post-mortem analysis of some short latency spikes on a
> > heavily loaded database, I found that the reason for (less than 10 second
> > latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> > At the same time graphical monitoring shows that during this few second
> > period there were some queries waiting in the BIND stage.
> >
> > Logging setup:
> > log_min_duration_statement=200ms
> > log_lock_waits=on
> > deadlock_timeout=100ms
> > So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should
> > be in the log.
> > But in the log I see only spikes on slow BIND but not lock waits logged.
>
> What version postgres? What settings have non-default values ?
> What OS/version? What environment/hardware? VM/image/provider/...
> What are the queries that are running BIND ? What parameter types ?
> Are the slow BINDs failing? Are their paramters being logged ?
> What else is running besides postgres ? Are the DB clients local or
> remote ? It shouldn't matter, but what client library?
>

What version of postgres? - 14.6

What settings have non-default values ? - a lot (it's 48 core Amazon EC2
server with 396GB of RAM)
(e.g. it carefully tuned database for particular workload)

What OS/version? - Ubuntu 20.04LTS

What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM
and local NVME storage
(i3en.12xlarge)

What are the queries that are running BIND ? - nothing special, e.g.
during problem period a lot completely different queries become stuck in
BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in
general it feel that whole BIND/PARSE mechanics lock for short period
==== LOG SAMPLE ==========================
2023-01-01 09:07:31.622 UTC 1848286 ****** from [local] [vxid:84/20886619
txid:0] [PARSE] LOG: duration: 235.472 ms parse <unnamed>: SELECT
COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at >
= '2022-12-31 09:07:31.350000') AND (job_stats_master.created_at <
'2023-01-01 09:07:31.350000') AND "job_stats_master"."employer_id" = ****
AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" =
**** AND "job_stats_master"."ip_matching_id" = *****
2023-01-01 09:07:31.622 UTC 1898699 ****** from [local] [vxid:158/22054921
txid:0] [BIND] LOG: duration: 231.274 ms bind <unnamed>: SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1898699 ******* from [local] [vxid:158/22054921
txid:0] [BIND] DETAIL: parameters: $1 = '0'
2023-01-01 09:07:31.622 UTC 1794756 ******* from [local] [vxid:281/10515416
txid:0] [BIND] LOG: duration: 231.024 ms bind <unnamed>: SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1794756 ******* from [local] [vxid:281/10515416
txid:0] [BIND] DETAIL: parameters: $1 = '0'

... 5 pages of BIND/PARSE of different/unrelated to each other queries
logged with over 100ms runtime

2023-01-01 09:07:31.623 UTC 1806315 ******* from [local] [vxid:231/17406673
txid:0] [BIND] LOG: duration: 139.372 ms bind <unnamed>: SELECT
employers.*, third_party_employer_pixels.facebook_pixel_id AS
facebook_pixel_id, third_party_employer_pixels.google_pixel_id AS
google_pixel_id, third_party_employer_pixels.google_actions AS
google_actions, employer_pixel_configurations.solution AS
tracking_solution, employer_pixel_configurations.domain_name AS
domain, settings.use_multiple_bids FROM employers LEFT JOIN
third_party_employer_pixels ON third_party_employer_pixels.employer_id =
employers.id LEFT JOIN employer_pixel_configurations ON
employer_pixel_configurations.id = employers.id LEFT JOIN settings
ON settings.id = employers.setting_id WHERE employers.id =
$1
2023-01-01 09:07:31.623 UTC 1806315 ******* from [local] [vxid:231/17406673
txid:0] [BIND] DETAIL: parameters: $1 = '*****'
2023-01-01 09:07:31.624 UTC 1806321 ******* from [local] [vxid:176/21846997
txid:0] [BIND] LOG: duration: 120.237 ms bind <unnamed>: SELECT
job_boards.*, enterprises.product_type,
feed_settings.use_employer_exported_name as use_employer_exported_name,
integration_job_board_settings.integration_status as integration_status
FROM job_boards LEFT JOIN integration_job_board_settings ON
integration_job_board_settings.id =
job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON
enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON
feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1
2023-01-01 09:07:31.624 UTC 1806321 ******* from [local] [vxid:176/21846997
txid:0] [BIND] DETAIL: parameters: $1 = '****'
===============================================================
What really curious in the log: that every of 100+ stuck in PARSE/BIND
stage queries that had been logged (and thus unstuck) in the same exact
moment... that highly likely means that they all had been stuck in the same
single place.
E.g. something locked the whole PARSE/BIND machinery (but not an EXECUTE)
for 200+ms.

Are the slow BINDs failing?
No, they all executed successfully later after being unstuck.

Are their parameters being logged ?
Yes.

What else is running besides postgres ?
Nothing else , dedicated DB server.

Are the DB clients local or remote ?
Remote all over a fast network.

It shouldn't matter, but what client library?
50% ROR (ruby on rails) / 50% java(jdbc).

Problem that issue happens only few times per 24 hour and usual duration
under 1second
so it very hard to catch problem with perf or gdb or strace.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message MichaelDBA 2023-01-01 13:27:34 Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
Previous Message Justin Pryzby 2022-12-31 14:32:03 Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)