From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Annoying corruption in PostgreSQL. |
Date: | 2023-10-27 18:28:09 |
Message-ID: | fd2fb9bc-6dd2-5adc-1787-849e69a2f6c9@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/27/23 14:19, Kirill Reshke wrote:
> Hi hackers!
>
> We run a large amount of PostgreSQL clusters in our production. They
> differ by versions (we have 11-16 pg), load, amount of data, schema,
> etc. From time to time, postgresql corruption happens. It says
> ERROR,XX001,"missing chunk number 0 for toast value 18767319 in
> pg_toast_2619",,,,,,"vacuum full ;"
>
> in logs. the missing chunk number almost every is equal to zero, while
> other values vary. There are no known patterns, which triggers this
> issue. Moreover, if trying to rerun the VACUUM statement against
> relations from a log message, it succeeds all the time. So, we just
> ignore these errors. Maybe it is just some wierd data race?
>
> We don't know how to trigger this problem, or why it occurs. I'm not
> asking you to resolve this issue, but to help with debugging. What can
> we do to deduct failure reasons? Maybe we can add more logging somewhere
> (we can deploy a special patched PostgreSQL version everywhere), to have
> more information about the issue, when it happens next time?
>
For starters, it'd be good to know something about the environment, and
stuff that'd tell us if there's some possible pattern:
1) Which exact PG versions are you observing these errors on?
2) In the error example you shared it's pg_toast_2619, which is the
TOAST table for pg_statistic (probably). Is it always this relation? Or
what relations you noticed this for?
3) What kind of commands are triggering this? In the example it seems to
be vacuum full. Did you see it for other commands too? People generally
don't do VACUUM FULL very often, particularly not in environments with
concurrent activity.
Considering you don't know what's causing this, or what to look for, I
think it might be interesting to use pg_waldump, and investigate what
happened to the page containing the TOAST chunk and to the page
referencing it. Do you have physical backups and ability to do PITR?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2023-10-27 18:48:32 | Re: race condition in pg_class |
Previous Message | Tom Lane | 2023-10-27 18:00:38 | Re: Enderbury Island disappeared from timezone database |