Re: pt_toast table seems to be

From: Rui DeSousa <rui(dot)desousa(at)icloud(dot)com>
To: Jorge Daniel <elgaita(at)hotmail(dot)com>
Cc: "pgsql-admin(at)lists(dot)postgresql(dot)org" <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: pt_toast table seems to be
Date: 2018-01-03 19:23:37
Message-ID: 79C6454F-E7F3-4AB9-A329-BECB41835BD4@icloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Do you mean the Postgres instance is killed or the docker instance is killed? For OOM, what’s the memory configuration and how much swap is allocated?

I would look at your disk subsystem from the prospected of Postgres; when it issues a sync() does your setup honor it? If you docker instance is crashing then it seems like sync() might not honored which would lead to corruption issues you’re describing.

> On Jan 3, 2018, at 10:25 AM, Jorge Daniel <elgaita(at)hotmail(dot)com> wrote:
>
> Hi guys
> This is my first post in the comunity so my apologies in advance about the formalities.
>
>
> In a production db , and after several OOM killer events on the postgres running on a docker (now fixed) we remain with a kind of corrupted pg_toast 😐:
>
>
> 2017-11-29 23:46:13.147 PST rhost=10.149.54.5(54750) app=PostgreSQL JDBC Driver:user=veon:db=veon:ERROR: unexpected chunk number 0 (expected 1) for toast value 17143928 in pg_toast_77809
> 2017-11-29 23:46:13.147 PST rhost=10.149.54.5(54750) app=PostgreSQL JDBC Driver:user=veon:db=veon:STATEMENT: SELECT "id", "needs_pumping", "needs_lucene", "surveybatchid", "ip_address", "is_open_proxy", "is_partial", "cookies", "social_media_source", "social_media_review_id", "social_media_pull_id", "social_media_pull_id_long", "source_config_unitid", "source_config_source_id", "updated", "external_id_printable_coder", "unique_reusable_id", "invite_errmsg", "reminder_errmsg", "last_seen_pagename", "last_submitted_pagename", "has_content_fields", "logid", "alertid", "language_name", "e_status", "e_reminder_status", "e_lastupdated", "e_creationdate", "e_sampleddate", "e_responsedate", "e_invitationdate", "reminderdate", "e_delivereddate", "e_remindeddate", "e_accepteddate", "e_initialfinishdate", "e_ta_completed_date", "e_expirationdate", "e_survey_method", "e_survey_source", "e_survey_type", "parse_status", "tagging_attempts", "e_optout", "e_bounce_category", "feed_fileid", "feed_file_recordid", "startdate", "e_lastname", "e_firstname", "e_address", "e_address2", "e_city", "e_state", "e_postalcode", "e_phone", "e_email", "is_mobile", "is_mobile_first", "is_mobile_finished", "is_cookie_confirmation_needed", "exclude_from_problem_tracker", "is_invitation_error", "page_number_seen_last", "page_number_seen_highest", "pages_submitted", "pages_validation_failed", "last_submit_date", "user_agent", "jsonb_fields", "thrift", "episodeid", "e_unitid", "e_committed_survey_specid", "customer_veonid", "xmin" FROM "survey_prior" WHERE survey_veon.surveyid = ANY($1)
>
> The usual fix is :
>
> REINDEX TABLE pg_toast_77809;
>
> After this we can select * from survery_prior with no problem, and we can dump the entire table if we wish it.
>
> Then the jobs resumes and we're happy until we hit again with the same error , this occurs randomly , we suspect on the UPDATE on certain row is creating the corruption.
>
> We can't reproduce the ERROR because we can't do a full scan of the table while the error is still alive () on-call guys applies the fix inmediatly to resume production ).
>
> So how can we discover the bad-rows and expose them, or how do we check the consistency of the pt_toast table?
>
> We need this evidence to ask for window and a complete dump/restore that will wipe out this problem.
>
> Thx in advance
>
> Jorge Daniel Fernandez

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Dave Bolt 2018-01-03 20:29:07 RE: Failed to execute pg_dump
Previous Message Scott Whitney 2018-01-03 19:14:51 Re: Failed to execute pg_dump