From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: finding changed blocks using WAL scanning |
Date: | 2019-04-24 14:10:20 |
Message-ID: | 20190424141020.fym7orivwhrmphys@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Apr 24, 2019 at 09:25:12AM -0400, Robert Haas wrote:
>On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> For this particular use case, wouldn't you want to read the WAL itself
>> and use that to issue prefetch requests? Because if you use the
>> .modblock files, the data file blocks will end up in memory but the
>> WAL blocks won't, and you'll still be waiting for I/O.
>
>I'm still interested in the answer to this question, but I don't see a
>reply that specifically concerns it. Apologies if I have missed one.
>
I don't think prefetching WAL blocks is all that important. The WAL
segment was probably received fairly recently (either from primary or
archive) and so it's reasonable to assume it's still in page cache. And
even if it's not, sequential reads are handled by readahead pretty well.
Which is a form of prefetching.
But even if WAL prefetching was useful in some cases, I think it's mostly
orthogonal issue - it certainly does not make prefetching of data pages
unnecessary.
>Stepping back a bit, I think that the basic issue under discussion
>here is how granular you want your .modblock files. At one extreme,
>one can imagine an application that wants to know exactly which blocks
>were accessed at exact which LSNs. At the other extreme, if you want
>to run a daily incremental backup, you just want to know which blocks
>have been modified between the start of the previous backup and the
>start of the current backup - i.e. sometime in the last ~24 hours.
>These are quite different things. When you only want approximate
>information - is there a chance that this block was changed within
>this LSN range, or not? - you can sort and deduplicate in advance;
>when you want exact information, you cannot do that. Furthermore, if
>you want exact information, you must store an LSN for every record; if
>you want approximate information, you emit a file for each LSN range
>and consider it sufficient to know that the change happened somewhere
>within the range of LSNs encompassed by that file.
>
Those are the extreme design options, yes. But I think there may be a
reasonable middle ground, that would allow using the modblock files for
both use cases.
>It's pretty clear in my mind that what I want to do here is provide
>approximate information, not exact information. Being able to sort
>and deduplicate in advance seems critical to be able to make something
>like this work on high-velocity systems.
Do you have any analysis / data to support that claim? I mean, it's
obvious that sorting and deduplicating the data right away makes
subsequent processing more efficient, but it's not clear to me that not
doing it would make it useless for high-velocity systems.
> If you are generating a
>terabyte of WAL between incremental backups, and you don't do any
>sorting or deduplication prior to the point when you actually try to
>generate the modified block map, you are going to need a whole lot of
>memory (and CPU time, though that's less critical, I think) to process
>all of that data. If you can read modblock files which are already
>sorted and deduplicated, you can generate results incrementally and
>send them to the client incrementally and you never really need more
>than some fixed amount of memory no matter how much data you are
>processing.
>
Sure, but that's not what I proposed elsewhere in this thread. My proposal
was to keep mdblocks "raw" for WAL segments that were not recycled yet (so
~3 last checkpoints), and deduplicate them after that. So vast majority of
the 1TB of WAL will have already deduplicated data.
Also, maybe we can do partial deduplication, in a way that would be useful
for prefetching. Say we only deduplicate 1MB windows - that would work at
least for cases that touch the same page frequently (say, by inserting to
the tail of an index, or so).
>While I'm convinced that this particular feature should provide
>approximate rather than exact information, the degree of approximation
>is up for debate, and maybe it's best to just make that configurable.
>Some applications might work best with small modblock files covering
>only ~16MB of WAL each, or even less, while others might prefer larger
>quanta, say 1GB or even more. For incremental backup, I believe that
>the quanta will depend on the system velocity. On a system that isn't
>very busy, fine-grained modblock files will make incremental backup
>more efficient. If each modblock file covers only 16MB of data, and
>the backup manages to start someplace in the middle of that 16MB, then
>you'll only be including 16MB or less of unnecessary block references
>in the backup so you won't incur much extra work. On the other hand,
>on a busy system, you probably do not want such a small quantum,
>because you will then up with gazillions of modblock files and that
>will be hard to manage. It could also have performance problems,
>because merging data from a couple of hundred files is fine, but
>merging data from a couple of hundred thousand files is going to be
>inefficient. My experience hacking on and testing tuplesort.c a few
>years ago (with valuable tutelage by Peter Geoghegan) showed me that
>there is a slow drop-off in efficiency as the merge order increases --
>and in this case, at some point you will blow out the size of the OS
>file descriptor table and have to start opening and closing files
>every time you access a different one, and that will be unpleasant.
>Finally, deduplication will tend to be more effective across larger
>numbers of block references, at least on some access patterns.
>
I agree with those observations in general, but I don't think it somehow
proves we have to deduplicate/sort the data.
FWIW no one cares about low-velocity systems. While raw modblock files
would not be an issue on them, it's also mostly uninteresting from the
prefetching perspective. It's the high-velocity sytems that have lag.
>So all of that is to say that if somebody wants modblock files each of
>which covers 1MB of WAL, I think that the same tools I'm proposing to
>build here for incremental backup could support that use case with
>just a configuration change. Moreover, the resulting files would
>still be usable by the incremental backup engine. So that's good: the
>same system can, at least to some extent, be reused for whatever other
>purposes people want to know about modified blocks.
+1 to configuration change, at least during the development phase. It'll
allow comfortable testing and benchmarking.
>On the other hand, the incremental backup engine will likely not cope
>smoothly with having hundreds of thousands or millions of modblock files
>shoved down its gullet, so if there is a dramatic difference in the
>granularity requirements of different consumers, another approach is
>likely indicated. Especially if some consumer wants to see block
>references in the exact order in which they appear in WAL, or wants to
>know the exact LSN of each reference, it's probably best to go for a
>different approach. For example, pg_waldump could grow a new option
>which spits out just the block references and in a format designed to be
>easily machine-parseable; or a hypothetical background worker that does
>prefetching for recovery could just contain its own copy of the
>xlogreader machinery.
>
Again, I don't think we have to keep the raw modblock files forever. Send
them to the archive, remove/deduplicate/sort them after we recycle the WAL
segment, or something like that. That way the incremental backups don't
need to deal with excessive number of modblock files.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2019-04-24 14:13:09 | Re: Regression test PANICs with master-standby setup on same machine |
Previous Message | Laurenz Albe | 2019-04-24 14:03:12 | Re: pgsql: Allow insert and update tuple routing and COPY for foreign table |