| From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> | 
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> | 
| Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: logical decoding / rewrite map vs. maxAllocatedDescs | 
| Date: | 2018-08-10 21:59:44 | 
| Message-ID: | 470adb65-5101-4659-d213-41bde1eef8f2@2ndquadrant.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 08/10/2018 11:13 PM, Andres Freund wrote:
> On 2018-08-10 22:57:57 +0200, Tomas Vondra wrote:
>>
>>
>> On 08/09/2018 07:47 PM, Alvaro Herrera wrote:
>>> On 2018-Aug-09, Tomas Vondra wrote:
>>>
>>>> I suppose there are reasons why it's done this way, and admittedly the test
>>>> that happens to trigger this is a bit extreme (essentially running pgbench
>>>> concurrently with 'vacuum full pg_class' in a loop). I'm not sure it's
>>>> extreme enough to deem it not an issue, because people using many temporary
>>>> tables often deal with bloat by doing frequent vacuum full on catalogs.
>>>
>>> Actually, it seems to me that ApplyLogicalMappingFile is just leaking
>>> the file descriptor for no good reason.  There's a different
>>> OpenTransientFile call in ReorderBufferRestoreChanges that is not
>>> intended to be closed immediately, but the other one seems a plain bug,
>>> easy enough to fix.
>>>
>>
>> Indeed. Adding a CloseTransientFile to ApplyLogicalMappingFile solves
>> the issue with hitting maxAllocatedDecs. Barring objections I'll commit
>> this shortly.
> 
> Yea, that's clearly a bug. I've not seen a patch, so I can't quite
> formally sign off, but it seems fairly obvious.
> 
> 
>> But while running the tests on this machine, I repeatedly got pgbench
>> failures like this:
>>
>> client 2 aborted in command 0 of script 0; ERROR:  could not read block
>> 3 in file "base/16384/24573": read only 0 of 8192 bytes
>>
>> That kinda reminds me the issues we're observing on some buildfarm
>> machines, I wonder if it's the same thing.
> 
> Oooh, that's interesting! What's the precise recipe that gets you there?
> 
I don't have an exact reproducer - it's kinda rare and unpredictable,
and I'm not sure how much it depends on the environment etc. But I'm
doing this:
1) one cluster with publication (wal_level=logical)
2) one cluster with subscription to (1)
3) simple table, replicated from (1) to (2)
   -- publisher
   create table t (a serial primary key, b int, c int);
   create publication p for table t;
   -- subscriber
   create table t (a serial primary key, b int, c int);
   create subscription s CONNECTION '...' publication p;
4) pgbench inserting rows into the replicated table
pgbench -n -c 4 -T 300 -p 5433 -f insert.sql test
5) pgbench doing vacuum full on pg_class
pgbench -n -f vacuum.sql -T 300 -p 5433 test
And once in a while I see failures like this:
   client 0 aborted in command 0 of script 0; ERROR:  could not read
   block 3 in file "base/16384/86242": read only 0 of 8192 bytes
   client 3 aborted in command 0 of script 0; ERROR:  could not read
   block 3 in file "base/16384/86242": read only 0 of 8192 bytes
   client 2 aborted in command 0 of script 0; ERROR:  could not read
   block 3 in file "base/16384/86242": read only 0 of 8192 bytes
or this:
   client 2 aborted in command 0 of script 0; ERROR:  could not read
   block 3 in file "base/16384/89369": read only 0 of 8192 bytes
   client 1 aborted in command 0 of script 0; ERROR:  could not read
   block 3 in file "base/16384/89369": read only 0 of 8192 bytes
I suspect there's some other ingredient, e.g. some manipulation with the
subscription. Or maybe it's not needed at all and I'm just imagining things.
regards
-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
| Attachment | Content-Type | Size | 
|---|---|---|
| vacuum.sql | application/sql | 21 bytes | 
| insert.sql | application/sql | 34 bytes | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andreas Seltenreich | 2018-08-10 22:12:07 | [sqlsmith] ERROR: partition missing from subplans | 
| Previous Message | Andrew Dunstan | 2018-08-10 21:55:06 | Re: libpq compression |