Reduce/eliminate the impact of FPW

From: Daniel Wood <hexexpert(at)comcast(dot)net>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Reduce/eliminate the impact of FPW
Date: 2020-08-03 05:53:07
Message-ID: 775560090.48884.1596433987597@connect.xfinity.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I thought that the biggest reason for the pgbench RW slowdown during a checkpoint was the flood of dirty page writes increasing the COMMIT latency. It turns out that the documentation which states that FPW's start "after a checkpoint" really means after a CKPT starts. And this is the really cause of the deep dip in performance. Maybe only I was fooled... :-)

If we can't eliminate FPW's can we at least solve the impact of it? Instead of writing the before images of pages inline into the WAL, which increases the COMMIT latency, write these same images to a separate physical log file. The key idea is that I don't believe that COMMIT's require these buffers to be immediately flushed to the physical log. We only need to flush these before the dirty pages are written. This delay allows the physical before image IO's to be decoupled and done in an efficient manner without an impact to COMMIT's.

1. When we generate a physical image add it to an in memory buffer of before page images.
2. Put the physical log offset of the before image into the WAL record. This is the current physical log file size plus the offset in the in-memory buffer of pages.
3. Set a bit in the bufhdr indicating this was done.
4. COMMIT's do not need to worry about those buffers.
5. Periodically flush the in-memory buffer and clear the bit in the BufHdr.
6. During any dirty page flushing if we see the bit set, which should be rare, then make sure we get our before image flushed. This would be similar to our LSN based XLogFlush().
Do we need these before images for more than one CKPT? I don't think so. Do PITR's require before images since it is a continuous rollforward from a restore? Just some of considerations.

Do I need to back this physical log up? I likely(?) need to deal with replication.

Turning off FPW gives about a 20%, maybe more, boost on a pgbench TPC-B RW workload which fits in the buffer cache. Can I get this 20% improvement with a separate physical log of before page images?

Doing IO's off on the side, but decoupled from the WAL stream, doesn't seem to impact COMMIT latency on modern SSD based storage systems. For instance, you can hammer a shared data and WAL SSD filesystem with dirty page writes from the CKPT, at near the MAX IOPS of the SSD, and not impact COMMIT latency. However, this presumes that the CKPT's natural spreading of dirty page writes across the CKPT target doesn't push too many outstanding IO's into the storage write Q on the OS/device.
NOTE: I don't believe the CKPT's throttling is perfect and I think a burst of dirty pages into the cache just before a CKPT might cause the Q to be flooded and this would then also further slow TPS during the CKPT. But a fix to this is off topic from the FPW issue.

Thanks to Andres Freund for both making me aware of the Q depth impact on COMMIT latency and the hint that FPW might also be causing the CKPT slowdown. FYI, I always knew about FPW slowdown in general but I just didn't realize it was THE primary cause of CKPT TPS slowdown on pgbench. NOTE: I realize that spinning media might exhibit different behavior. And I didn't not say dirty page writing has NO impact on good SSD's. It depends, and this is a subject for a later date as I have a theory as to why I something see a sawtooth performance for pgbench TPC-B and sometimes a square wave but I want to prove if first.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2020-08-03 05:54:51 Re: problem with RETURNING and update row movement
Previous Message Justin Pryzby 2020-08-03 04:59:48 [PATCH v1] elog.c: Remove special case which avoided %*s format strings..