BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: garrett(at)rave(dot)io
Subject: BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table
Date: 2022-12-01 21:23:06
Message-ID: 17704-601ac10a659e8855@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17704
Logged by: Garrett Everding
Email address: garrett(at)rave(dot)io
PostgreSQL version: 14.6
Operating system: Ubuntu 22.04
Description:

Hi,

I'm not sure if this is the right place to post this and if this _isn't_ a
bug and is instead a misconfiguration I can repost somewhere else.

Today I rolled out a new hash partitioned table with 1024 partitions to our
production DB (PostgreSQL 14.6 (Ubuntu 14.6-1.pgdg22.04+1) on
aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04)
11.3.0, 64-bit). Previously we had been using pganalyze2 and pgmetrics with
no issue. Starting with the deployment of the new partitioned table anytime
pgmetrics or pganalyze2 would run we would get a SIGBUS and postgresql would
crash.

```
2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= LOG: server
process (PID 1341581) was terminated by signal 7: Bus error
2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= DETAIL: Failed
process was running: SET lock_timeout TO '100ms';SET statement_timeout TO
'5s';
select
(extract(epoch from now()) * 1e9)::int8 as epoch_ns,
quote_ident(table_schema)||'.'||quote_ident(table_name) as
tag_table,
md5((array_agg((c.*)::text order by ordinal_position))::text)
from (
SELECT current_database()::information_schema.sql_identifier AS
table_catalog,
nc.nspname::information_schema.sql_identifier AS table_schema,
c.relname::information_schema.sql_identifier AS table_name,
a.attname::information_schema.sql_identifier AS column_name,
a.attnum::information_schema.cardinal_number AS
ordinal_position,
pg_get_expr(ad.adbin,
ad.adrelid)::information_schema.character_data AS column_default,
CASE
WHEN a.attnotnull OR t.typtype = 'd'::"char" AND
t.typnotnull THEN 'NO'::text
ELSE 'YES'::text
END::information_schema.yes_or_no AS is_nullable,
CASE
WHEN t.typtype = 'd'::"char" THEN
CASE
WHEN bt.typelem <> 0::oid AND bt.typlen
2022-12-01 17:48:29 UTC [915447]: user=,db=,app=,client= LOG: terminating
any other active server processes
```
2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= LOG: server
process (PID 1347091) was terminated by signal 7: Bus error
2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= DETAIL: Failed
process was running:
SELECT
current_database() AS db, schemaname, tablename, reltuples::bigint
AS tups, relpages::bigint AS pages, otta,
ROUND(CASE WHEN otta=0 OR sml.relpages=0 OR sml.relpages=otta THEN
0.0 ELSE sml.relpages/otta::numeric END,1) AS tbloat,
CASE WHEN relpages < otta THEN 0 ELSE relpages::bigint - otta END
AS wastedpages,
CASE WHEN relpages < otta THEN 0 ELSE
bs*(sml.relpages-otta)::bigint END AS wastedbytes,
CASE WHEN relpages < otta THEN '0 bytes'::text ELSE
(bs*(relpages-otta))::bigint::text || ' bytes' END AS wastedsize,
iname, ituples::bigint AS itups, ipages::bigint AS ipages,
iotta,
ROUND(CASE WHEN iotta=0 OR ipages=0 OR ipages=iotta THEN 0.0 ELSE
ipages/iotta::numeric END,1) AS ibloat,
CASE WHEN ipages < iotta THEN 0 ELSE ipages::bigint - iotta END AS
wastedipages,
CASE WHEN ipages < iotta THEN 0 ELSE (bs*(ipages-iotta))::bigint
END AS wastedibytes,
CASE WHEN ipages < iotta THEN '0 bytes' ELSE
(bs*(ipages-iotta))::bigint::text || ' bytes' END AS wastedisize,
CASE WHEN relpages < ott
2022-12-01 18:21:43 UTC [915447]: user=,db=,app=,client= LOG: terminating
any other active server processes
```

On another connection open connection during one of the crashes we saw this
error:
```
insert into <table> select * from <other table>;
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Failed.
```

This is a very active db serving 30k-50k TPS with 7TB of data.

I haven't been able to repro this on our staging environment.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2022-12-01 21:59:03 Re: BUG #17704: Monitoring Queries trigger 'Signal 7: Bus Error' After Creating Hash Partitioned Table
Previous Message David G. Johnston 2022-12-01 20:43:14 Re: Bug in jsonb_path_exists (maybe _match) one-element scalar/variable jsonpath handling