Re: logical changeset generation v6.2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical changeset generation v6.2
Date: 2013-10-24 14:59:21
Message-ID: CA+TgmoZvuZEqtMpEemL9C2mGYB=rYZ2TRnONZiOToRF14V4NTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
>> On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> That strikes me as a flaw in the implementation rather than the idea.
>> >> You're presupposing a patch where the necessary information is
>> >> available in WAL yet you don't make use of it at the proper time.
>> >
>> > The problem is that the mapping would be somewhere *ahead* from the
>> > transaction/WAL we're currently decoding. We'd need to read ahead till
>> > we find the correct one.
>>
>> Yes, I think that's what you need to do.
>
> My problem with that is that rewrite can be gigabytes into the future.
>
> When reading forward we could either just continue reading data into the
> reorderbuffer, but delay replaying all future commits till we found the
> currently needed remap. That might have quite the additional
> storage/memory cost, but runtime complexity should be the same as normal
> decoding.
> Or we could individually read ahead for every transaction. But doing so
> for every transaction will get rather expensive (rougly O(amount_of_wal^2)).

[ Sorry it's taken me a bit of time to get back to this; other tasks
intervened, and I also just needed some time to let it settle in my
brain. ]

If you read ahead looking for a set of ctid translations from
relfilenode A to relfilenode B, and along the way you happen to
encounter a set of translations from relfilenode C to relfilenode D,
you could stash that set of translations away somewhere, so that if
the next transaction you process needs that set of mappings, it's
already computed. With that approach, you'd never have to pre-read
the same set of WAL files more than once.

But, as I think about it more, that's not very different from your
idea of stashing the translations someplace other than WAL in the
first place. I mean, if the read-ahead thread generates a series of
files in pg_somethingorother that contain those maps, you could have
just written the maps to that directory in the first place. So on
further review I think we could adopt that approach.

However, I'm leery about the idea of using a relation fork for this.
I'm not sure whether that's what you had it mind, but it gives me the
willies. First, it adds distributed overhead to the system, as
previously discussed; and second, I think the accounting may be kind
of tricky, especially in the face of multiple rewrites. I'd be more
inclined to find a separate place to store the mappings. Note that,
AFAICS, there's no real need for the mapping file to be
block-structured, and I believe they'll be written first (with no
readers) and subsequently only read (with no further writes) and
eventually deleted.

One possible objection to this is that it would preclude decoding on a
standby, which seems like a likely enough thing to want to do. So
maybe it's best to WAL-log the changes to the mapping file so that the
standby can reconstruct it if needed.

> I think that'd be pretty similar to just disallowing VACUUM
> FREEZE/CLUSTER on catalog relations since effectively it'd be to
> expensive to use.

This seems unduly pessimistic to me; unless the catalogs are really
darn big, this is a mostly theoretical problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-10-24 15:04:02 Re: [PATCH] Use MAP_HUGETLB where supported (v3)
Previous Message Peter Geoghegan 2013-10-24 14:54:32 Re: Add min and max execute statement time in pg_stat_statement