Postgres v16.4 crashes on segfault when memory >= 16gb

From: Weslley Braga <wbraga(at)implentio(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Postgres v16.4 crashes on segfault when memory >= 16gb
Date: 2024-08-21 19:53:00
Message-ID: CAKeJUPB1E16fGBrXTO3JKuPv3VW0wywCDHd3Ece2h2uHwtJSkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Postgres team,

I've been observing for the past month. I'm running PostgreSQL in
Kubernetes through the CloudNativePG project. When I bump the pods resource
configuration to anything above 16gb I notice that complex queries such as
the one shared below cause the runtime to fail with a segfault see logs for
details. The same query with a configuration < 16gb memory works ok.

Segfault log:

server process (PID 197) was terminated by signal 11: Segmentation
fault","detail":"Failed │
│ process was running: -- Define the variables\nWITH vars AS (\n
SELECT\n 'f6dd1d │
│ 37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id,\n
'I-39438' AS invoice_number,\n │
│ 'SO5626760' AS sales_order_number\n),\nrelated_ids AS (\n
SELECT DISTINCT\n │
│ f.id AS fulfillment_id,\n fp.id AS
fulfillment_package_id,\n fpli.id AS │
│ fulfillment_package_line_item_id,\n f.sales_order_number,\n
f.brand_id,\n │
│ li.number AS invoice_number\n FROM\n cdm.fulfillment
AS f\n JOIN\n │
│ cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id\n
JOIN\n cdm.ful │
│ fillment_package AS fp ON f.id = fp.fulfillment_id\n JOIN\n
cdm.fulfillment_pack │
│ age_line_item AS fpli ON fp.id = fpli.fulfillment_package_id\n
CROSS JOIN\n vars │
│ \n WHERE\n f.sales_order_number =
vars.sales_order_number\n AND f.brand_i │
│ d = vars.brand_id\n AND li.number =
vars.invoice_number\n),\nreconciled_charges AS │
│ (\n SELECT\n ri.sales_order_number,\n
ri.brand_id,\n ri.invoice_n" │
│ ,"backend_type":"postmaster","query_id":"0"}}

Query:

WITH vars AS (
SELECT
'f6dd1d37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id,
'I-39438' AS invoice_number,
'SO5626760' AS sales_order_number
),
related_ids AS (
SELECT DISTINCT
f.id AS fulfillment_id,
fp.id AS fulfillment_package_id,
fpli.id AS fulfillment_package_line_item_id,
f.sales_order_number,
f.brand_id,
li.number AS invoice_number
FROM
cdm.fulfillment AS f
JOIN
cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id
JOIN
cdm.fulfillment_package AS fp ON f.id = fp.fulfillment_id
JOIN
cdm.fulfillment_package_line_item AS fpli ON fp.id =
fpli.fulfillment_package_id
CROSS JOIN
vars
WHERE
f.sales_order_number = vars.sales_order_number
AND f.brand_id = vars.brand_id
AND li.number = vars.invoice_number
),
reconciled_charges AS (
SELECT
ri.sales_order_number,
ri.brand_id,
ri.invoice_number,
rsc.difference_root_cause,
rsc.difference_amount,
COALESCE(sct_real.code, sct_synth.code) AS service_charge_type_code,
CASE
WHEN rsc.real_service_charge_id IS NOT NULL THEN 'REAL'
WHEN rsc.synthetic_service_charge_id IS NOT NULL THEN 'SYNTHETIC'
END AS charge_type
FROM
cdm.reconciled_service_charge AS rsc
LEFT JOIN
cdm.service_charge AS sc ON rsc.real_service_charge_id = sc.id
LEFT JOIN
cdm.service_charge_type AS sct_real ON
sc.service_charge_type_id = sct_real.id
LEFT JOIN
cdm.synthetic_service_charge AS ssc ON
rsc.synthetic_service_charge_id = ssc.id
LEFT JOIN
cdm.service_charge_type AS sct_synth ON ssc.charge_type_id =
sct_synth.id
INNER JOIN
related_ids AS ri ON
(sc.applied_to_type = 'Fulfillment' AND sc.applied_to_id =
ri.fulfillment_id) OR
(sc.applied_to_type = 'FulfillmentPackage' AND
sc.applied_to_id = ri.fulfillment_package_id) OR
(sc.applied_to_type = 'FulfillmentPackageLineItem' AND
sc.applied_to_id = ri.fulfillment_package_line_item_id) OR
(ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT' AND
ssc.applied_to_id = ri.fulfillment_id) OR
(ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE' AND
ssc.applied_to_id = ri.fulfillment_package_id) OR
(ssc.applied_to_type =
'SYNTHETIC_FULFILLMENT_PACKAGE_LINE_ITEM' AND ssc.applied_to_id =
ri.fulfillment_package_line_item_id)
WHERE
rsc.brand_id = (SELECT brand_id FROM vars) AND
rsc.difference_status <> 'MATCHING'
)
SELECT
sales_order_number,
brand_id,
invoice_number,
difference_root_cause,
SUM(difference_amount) AS total_difference_amount,
SUM(CASE WHEN difference_amount < 0 THEN difference_amount ELSE 0
END) AS total_unfavorable_amount,
SUM(CASE WHEN difference_amount > 0 THEN difference_amount ELSE 0
END) AS total_favorable_amount,
COUNT(CASE WHEN difference_amount < 0 THEN 1 END) AS unfavorable_count,
COUNT(CASE WHEN difference_amount > 0 THEN 1 END) AS favorable_count,
COUNT(*) AS total_error_count,
ARRAY_AGG(DISTINCT service_charge_type_code) AS services,
ARRAY_AGG(DISTINCT charge_type) AS charge_types
FROM
reconciled_charges
GROUP BY
sales_order_number, brand_id, invoice_number, difference_root_cause
ORDER BY
total_error_count DESC;

Here is my yaml manifest for the cnpg cluster:

apiVersion: postgresql.cnpg.io/v1kind: Clustermetadata:
annotations:
config.linkerd.io/skip-inbound-ports: "5432"
panfactum.com/admin-role: admin-implentio-pg-7649
panfactum.com/db: "true"
panfactum.com/db-type: PostgreSQL
panfactum.com/reader-role: reader-implentio-pg-7649
panfactum.com/service: pg-7649-pooler-rw.implentio
panfactum.com/service-port: "5432"
panfactum.com/superuser-role: superuser-implentio-pg-7649
panfactum.com/vault-mount: db/implentio-pg-7649
creationTimestamp: "2024-07-01T15:44:34Z"
generation: 70
labels:
id: pg-pg-7649-0cbb8555d636b38a
panfactum.com/environment: development
panfactum.com/local: "false"
panfactum.com/module: kube_pg_cluster
panfactum.com/prevent-lifetime-eviction: "true"
panfactum.com/region: us-west-2
panfactum.com/root-module: implentio_db
panfactum.com/scheduler: "true"
panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d
panfactum.com/stack-version: edge.24-08-12
panfactum.com/workload: pg-pg-7649
name: pg-7649
namespace: implentio
resourceVersion: "69072821"
uid: ddb60f66-4b90-419f-8f4d-c3e42d716f55spec:
affinity:
enablePodAntiAffinity: true
podAntiAffinityType: required
tolerations:
- effect: NoSchedule
key: spot
operator: Equal
value: "true"
- effect: NoSchedule
key: burstable
operator: Equal
value: "true"
- effect: NoSchedule
key: arm64
operator: Equal
value: "true"
topologyKey: node.kubernetes.io/instance-type
backup:
barmanObjectStore:
data:
compression: bzip2
jobs: 8
destinationPath: s3://implentio-pg-7649-backup-6b62299e849216e0/
s3Credentials:
inheritFromIAMRole: true
wal:
compression: bzip2
maxParallel: 8
retentionPolicy: 7d
target: prefer-standby
bootstrap:
initdb:
database: app
encoding: UTF8
localeCType: C
localeCollate: C
owner: app
postInitApplicationSQL:
- REVOKE ALL ON SCHEMA public FROM PUBLIC;
- CREATE ROLE reader NOINHERIT;
- GRANT pg_read_all_data TO reader;
- GRANT USAGE ON SCHEMA public TO reader;
- GRANT USAGE ON SCHEMA cdm TO reader;
- GRANT USAGE ON SCHEMA client TO reader;
- CREATE ROLE writer NOINHERIT;
- GRANT pg_write_all_data, pg_read_all_data TO writer;
- GRANT ALL PRIVILEGES ON SCHEMA public TO writer;
- GRANT ALL PRIVILEGES ON SCHEMA cdm TO writer;
- GRANT ALL PRIVILEGES ON SCHEMA client TO writer;
- GRANT CONNECT ON DATABASE app TO cnpg_pooler_pgbouncer;
- GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;
- GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer;
- GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer;
- CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS
TABLE (usename
name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT
usename, passwd
FROM pg_shadow WHERE usename=$1;'
- REVOKE ALL ON FUNCTION user_search(text) FROM public;
- GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;
postInitSQL:
- REVOKE ALL ON SCHEMA public FROM PUBLIC;
- CREATE ROLE cnpg_pooler_pgbouncer WITH LOGIN;
- GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;
- GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer;
- GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer;
- GRANT CONNECT ON DATABASE postgres TO cnpg_pooler_pgbouncer;
- CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS
TABLE (usename
name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT
usename, passwd
FROM pg_shadow WHERE usename=$1;'
- REVOKE ALL ON FUNCTION user_search(text) FROM public;
- GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;
certificates:
clientCASecret: pg-client-certs-331c
replicationTLSSecret: pg-client-certs-331c
serverCASecret: pg-server-certs-b04c
serverTLSSecret: pg-server-certs-b04c
enablePDB: false
enableSuperuserAccess: true
failoverDelay: 5
imageName: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4
inheritedMetadata:
annotations:
config.linkerd.io/skip-inbound-ports: "5432"
linkerd.io/inject: enabled
resize.topolvm.io/increase: 100%
resize.topolvm.io/storage_limit: 200Gi
resize.topolvm.io/threshold: 35%
labels:
id: pg-pg-7649-0cbb8555d636b38a
panfactum.com/environment: development
panfactum.com/local: "false"
panfactum.com/module: kube_pg_cluster
panfactum.com/prevent-lifetime-eviction: "true"
panfactum.com/region: us-west-2
panfactum.com/root-module: implentio_db
panfactum.com/scheduler: "true"
panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d
panfactum.com/stack-version: edge.24-08-12
panfactum.com/workload: pg-pg-7649
pg-cluster: implentio-pg-7649
instances: 3
logLevel: info
maxSyncReplicas: 0
minSyncReplicas: 0
monitoring:
customQueriesConfigMap:
- key: queries
name: cnpg-default-monitoring
disableDefaultQueries: false
enablePodMonitor: true
postgresGID: 26
postgresUID: 26
postgresql:
parameters:
archive_mode: "on"
archive_timeout: 5min
dynamic_shared_memory_type: posix
effective_cache_size: 6400MB
log_destination: csvlog
log_directory: /controller/log
log_filename: postgres
log_rotation_age: "0"
log_rotation_size: "0"
log_truncate_on_rotation: "false"
logging_collector: "on"
maintenance_work_mem: 1600MB
max_connections: "150"
max_parallel_workers: "32"
max_replication_slots: "32"
max_worker_processes: "32"
shared_buffers: 4000MB
shared_memory_type: mmap
shared_preload_libraries: ""
ssl_max_protocol_version: TLSv1.3
ssl_min_protocol_version: TLSv1.3
wal_keep_size: 1024MB
wal_level: logical
wal_log_hints: "on"
wal_receiver_timeout: 5s
wal_sender_timeout: 5s
work_mem: 27MB
syncReplicaElectionConstraint:
enabled: false
primaryUpdateMethod: switchover
primaryUpdateStrategy: unsupervised
priorityClassName: database
replicationSlots:
highAvailability:
enabled: true
slotPrefix: _cnpg_
synchronizeReplicas:
enabled: true
updateInterval: 30
resources:
limits:
memory: 20800Mi
requests:
cpu: "6"
memory: 16000Mi
schedulerName: panfactum
serviceAccountTemplate:
metadata:
annotations:
eks.amazonaws.com/role-arn:
arn:aws:iam::730335560480:role/pg-7649-20240701154426349300000002
smartShutdownTimeout: 5
startDelay: 600
stopDelay: 70
storage:
pvcTemplate:
resources:
requests:
storage: 20Gi
storageClassName: ebs-standard
resizeInUseVolumes: true
superuserSecret:
name: pg-7649-superuser-01bb512a09da19f2a01a4f1e79a5bb44dff309cf02c7afbb7b84aca8298765d4
switchoverDelay: 70
topologySpreadConstraints:
- labelSelector:
matchLabels:
id: pg-pg-7649-0cbb8555d636b38a
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedulestatus:
availableArchitectures:
- goArch: amd64
hash: 94527128605ac5100415106fe26c480531d094b3f36626e562a8135f342b89e4
- goArch: arm64
hash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4
certificates:
clientCASecret: pg-client-certs-331c
expirations:
pg-client-certs-331c: 2024-09-02 15:44:29 +0000 UTC
pg-server-certs-b04c: 2024-09-02 15:44:29 +0000 UTC
replicationTLSSecret: pg-client-certs-331c
serverAltDNSNames:
- pg-7649-rw
- pg-7649-rw.implentio
- pg-7649-rw.implentio.svc
- pg-7649-r
- pg-7649-r.implentio
- pg-7649-r.implentio.svc
- pg-7649-ro
- pg-7649-ro.implentio
- pg-7649-ro.implentio.svc
serverCASecret: pg-server-certs-b04c
serverTLSSecret: pg-server-certs-b04c
cloudNativePGCommitHash: 336ddf53
cloudNativePGOperatorHash:
9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4
conditions:
- lastTransitionTime: "2024-08-21T17:28:36Z"
message: Cluster is Ready
reason: ClusterIsReady
status: "True"
type: Ready
- lastTransitionTime: "2024-08-21T17:28:03Z"
message: Continuous archiving is working
reason: ContinuousArchivingSuccess
status: "True"
type: ContinuousArchiving
- lastTransitionTime: "2024-08-18T02:06:07Z"
message: Backup was successful
reason: LastBackupSucceeded
status: "True"
type: LastBackupSucceeded
configMapResourceVersion:
metrics:
cnpg-default-monitoring: "6071197"
currentPrimary: pg-7649-2
currentPrimaryTimestamp: "2024-08-21T17:28:01.989805Z"
firstRecoverabilityPoint: "2024-08-11T00:10:35Z"
firstRecoverabilityPointByMethod:
barmanObjectStore: "2024-08-11T00:10:35Z"
healthyPVC:
- pg-7649-1
- pg-7649-2
- pg-7649-5
image: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4
instanceNames:
- pg-7649-1
- pg-7649-2
- pg-7649-5
instances: 3
instancesReportedState:
pg-7649-1:
isPrimary: false
timeLineID: 192
pg-7649-2:
isPrimary: true
timeLineID: 192
pg-7649-5:
isPrimary: false
timeLineID: 191
instancesStatus:
healthy:
- pg-7649-1
- pg-7649-5
replicating:
- pg-7649-2
lastFailedBackup: "2024-07-28T00:09:04Z"
lastSuccessfulBackup: "2024-08-18T02:06:06Z"
lastSuccessfulBackupByMethod:
barmanObjectStore: "2024-08-18T02:06:06Z"
latestGeneratedNode: 5
managedRolesStatus: {}
phase: Cluster in healthy state
poolerIntegrations:
pgBouncerIntegration: {}
pvcCount: 3
readService: pg-7649-r
readyInstances: 2
secretsResourceVersion:
applicationSecretVersion: "61181252"
clientCaSecretVersion: "66383294"
replicationSecretVersion: "66383294"
serverCaSecretVersion: "66383293"
serverSecretVersion: "66383293"
superuserSecretVersion: "61180981"
switchReplicaClusterStatus: {}
targetPrimary: pg-7649-2
targetPrimaryTimestamp: "2024-08-21T17:27:58.902780Z"
timelineID: 192
topology:
instances:
pg-7649-1: {}
pg-7649-2: {}
pg-7649-5: {}
nodesUsed: 3
successfullyExtracted: true
writeService: pg-7649-rw

I'm glad to share more information to further help identify the issue. I've
also extracted core dumps from the error.

core_dumps.tar
<https://drive.google.com/file/d/1W6sZ5X5oWjuktBSCkuWgpt9Kxe3HIIs2/view?usp=drive_web>

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-08-21 21:31:37 Re: Postgres v16.4 crashes on segfault when memory >= 16gb
Previous Message Alvaro Herrera 2024-08-21 17:21:49 Re: BUG #18507: See C include file "ntstatus.h" for a description of the hexadecimal value.