Re: Direct I/O

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Direct I/O
Date: 2023-04-10 02:57:41
Message-ID: 20230410025741.whvq7w5ev4ficjuk@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-04-10 00:17:12 +1200, Thomas Munro wrote:
> I think there are two separate bad phenomena.
>
> 1. A concurrent modification of the user space buffer while writing
> breaks the checksum so you can't read the data back in, as . I can
> reproduce that with a stand-alone program, attached. The "verifier"
> process occasionally reports EIO while reading, unless you comment out
> the "scribbler" process's active line. The system log/dmesg gets some
> warnings.

I think we really need to think about whether we eventually we want to do
something to avoid modifying pages while IO is in progress. The only
alternative is for filesystems to make copies of everything in the IO path,
which is far from free (and obviously prevents from using DMA for the whole
IO). The copy we do to avoid the same problem when checksums are enabled,
shows up quite prominently in write-heavy profiles, so there's a "purely
postgres" reason to avoid these issues too.

> 2. The crake-style failure doesn't involve any reported checksum
> failures or errors, and I'm not sure if another process is even
> involved. I attach a complete syscall trace of a repro session. (I
> tried to get strace to dump 8192 byte strings, but then it doesn't
> repro, so we have only the start of the data transferred for each
> page.) Working back from the error message,
>
> ERROR: invalid page in block 78 of relation base/5/16384,
>
> we have a page at offset 638976, and we can find all system calls that
> touched that offset:
>
> [pid 26031] 23:26:48.521123 pwritev(50,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8192}], 1, 638976) = 8192
>
> [pid 26040] 23:26:48.568975 pwrite64(5,
> "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
> \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192
>
> [pid 26040] 23:26:48.593157 pread64(6,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 8192, 638976) = 8192
>
> In between the write of non-zeros and the read of zeros, nothing seems
> to happen that could justify that, that I can grok, but perhaps
> someone else will see something that I'm missing. We pretty much just
> have the parallel worker scanning the table, and writing stuff out as
> it does it. This was obtained with:

Have you tried to write a reproducer for this that doesn't involve postgres?
It'd certainly be interesting to know the precise conditions for this. E.g.,
can this also happen without O_DIRECT, if cache pressure is high enough for
the page to get evicted soon after (potentially simulated with fadvise or
such)?

We should definitely let the brtfs folks know of this issue... It's possible
that this bug was recently introduced even. What kernel version did you repro
this on Thomas?

I wonder if we should have a postgres-io-torture program in our tree for some
of these things. We've found issues with our assumptions on several operating
systems and filesystems, without systematically looking. Or even stressing IO
all that hard in our tests.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-04-10 03:16:31 Re: CREATE SUBSCRIPTION -- add missing tab-completes
Previous Message Wei Wang (Fujitsu) 2023-04-10 01:24:42 RE: Fix the description of GUC "max_locks_per_transaction" and "max_pred_locks_per_transaction" in guc_table.c