Re: snapshot too old issues, first around wraparound and then more.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kgrittn(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: snapshot too old issues, first around wraparound and then more.
Date: 2020-04-01 22:30:39
Message-ID: CAH2-Wz=FQ9rbBKkt1nXvz27kmd4A8i1+7dcLTNqpCYibxX83VQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I like that idea. I think that I've spotted what may be an independent
> bug, but I have to wait around for a minute or two to reproduce it
> each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1" (1 minute).

This example shows wrong answers to queries in session 2:

Session 1:

pg(at)regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg(at)regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg(at)regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg(at)regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg(at)regression:5432 [1444124]=*# select col from snapshot_bug where col
>= 0 order by col limit 14;
┌─────┐
│ col │
├─────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
│ 10 │
│ 11 │
│ 12 │
│ 13 │
│ 14 │
└─────┘
(14 rows)

So far so good. Now session 2 continues:

pg(at)regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

Session 1:

(repeats the same "select col from snapshot_bug where col >= 0 order
by col limit 14" query every 100 ms using psql's \watch 0.1)

Session 2:

pg(at)regression:5432 [1444078]=# vacuum snapshot_bug ;
VACUUM

Before too long, we see the following over in session 2 -- the answer
the query gives changes, even though this is a serializable
transaction:

Wed 01 Apr 2020 03:12:59 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
│ 10 │
│ 11 │
│ 12 │
│ 13 │
│ 14 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│ 15 │
│ 16 │
│ 17 │
│ 18 │
│ 19 │
│ 20 │
│ 21 │
│ 22 │
│ 23 │
│ 24 │
│ 25 │
│ 26 │
│ 27 │
│ 28 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│ 15 │
│ 16 │
│ 17 │
│ 18 │
│ 19 │
│ 20 │
│ 21 │
│ 22 │
│ 23 │
│ 24 │
│ 25 │
│ 26 │
│ 27 │
│ 28 │
└─────┘
(14 rows)

We continue to get this wrong answer for almost another minute (at
least on this occasion). Eventually we get "snapshot too old". Note
that the answer changes when we cross the "minute threshold"

Andres didn't explain anything to me that contributed to finding the
bug (though it could be a known bug, I don't think that it is). It
took me a surprisingly short amount of time to stumble upon this bug
-- I didn't find it because I have good intuitions about how to break
the feature.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-04-01 22:40:31 Re: Proposal: Expose oldest xmin as SQL function for monitoring
Previous Message Peter Geoghegan 2020-04-01 22:00:16 Re: snapshot too old issues, first around wraparound and then more.