From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: Cleaning up orphaned files using undo logs |
Date: | 2019-08-22 04:21:22 |
Message-ID: | CAFiTN-s5Mzr+9b86OtBqCsA74pBD3-FJAPWhrGOjR=qAGrZp0g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Aug 21, 2019 at 9:04 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > I have already attempted that part and I feel it is not making code
> > any simpler than what we have today. For packing, it's fine because I
> > can process all the member once and directly pack it into one memory
> > chunk and I can insert that to the buffer by one call of
> > InsertUndoBytes and that will make the code simpler.
>
> OK...
>
> The bigger issue here is that we don't seem to be making very much
> progress toward improving the overall design. Several significant
> improvements have been suggested:
>
> 1. Thomas suggested quite some time ago that we should make sure that
> the transaction header is the first optional header.
I think this is already done at least 2-3 version before. So now we
are updating the transaction header directly by writing at that offset
and we don't need staging for this.
> If we do that,
> then I'm not clear on why we even need this incremental unpacking
> stuff any more. The only reason for all of this machinery was so that
> we could find the transaction header at an unknown offset inside a
> complex record format; there is, if I understand correctly, no other
> case in which we want to incrementally decode a record.
> But if the
> transaction header is at a fixed offset, then there seems to be no
> need to even have incremental decoding at all. Or it can be very
> simple, with three stages: (1) we don't yet have enough bytes to
> figure out how big the record is; (2) we have enough bytes to figure
> out how big the record is and we have figured that out but we don't
> yet have all of those bytes; and (3) we have the whole record, we can
> decode the whole thing and we're done.
We can not know the complete size of the record even by reading the
header because we have a payload that is variable part and payload
length are stored in the payload header which again can be at random
offset. But, maybe we can still follow this idea which will make
unpacking far simpler. I have a few ideas
a) Store payload header right after the transaction header so that we
can easily know the complete record size.
b) Once we decode the first header by uur_info we can compute an exact
offset of the payload header and from there we can know the record
size.
>
> 2. Based on a discussion with Thomas, I suggested the GHOB stuff,
> which gets rid of the idea of transaction headers inside individual
> records altogether; instead, there is one undo blob per transaction
> (or maybe several if we overflow to another undo log) which begins
> with a sentinel byte that identifies it as owned by a transaction, and
> then the transaction header immediately follows that without being
> part of any record, and the records follow that data. As with the
> previous idea, this gets rid of the need for incremental decoding
> because it gets rid of the need to find the transaction header inside
> of a bigger record. As Thomas put it to me off-list, it puts the
> records inside of a larger chunk of space owned by the transaction
> instead of putting the transaction header inside of some particular
> record; that seems more correct than the way we have it now.
>
> 3. Heikki suggested that the format should be much more generic and
> that more should be left up to the AM. While neither Andres nor I are
> convinced that we should go as far in that direction as Heikki is
> proposing, the idea clearly has some merit, and we should probably be
> moving that way to some degree. For instance, the idea that we should
> store a block number and TID is a bit sketchy when you consider that a
> multi-insert operation really wants to store a TID list. The zheap
> tree has a ridiculous kludge to work around that problem; clearly we
> need something better. We've also mentioned that, in the future, we
> might want to support TIDs that are not 6 bytes, and that even just
> looking at what's currently under development, zedstore wants to treat
> TIDs as 48-bit opaque quantities, not a 4-byte block number and a
> 2-byte item pointer offset. So, there is clearly a need to go through
> the whole way we're doing this and rethink which parts are generic and
> which parts are AM-specific.
>
> 4. A related problem, which has been mentioned or at least alluded to
> by both Heikki and by me, is that we need a better way of handling the
> AM-specific data. Right now, the zheap code packs fixed-size things
> into the payload data and then finds them by knowing the offset where
> any particular data is within that field, but that's an unmaintainable
> mess. The zheap code could be improved by at least defining those
> offsets as constants someplace and adding some comments explaining the
> payload formats of various undo records, but even if we do that, it's
> not going to generalize very well to anything more complicated than a
> few fixed-size bits of data. I suggested using the pqformat stuff to
> try to structure that -- a suggestion to which Heikki has
> unfortunately not responded, because I'd really like to get his
> thoughts on it -- but whether we do that particular thing or not, I
> think we need to do something. In addition to wanting a better way of
> handling packing and unpacking for payload data, there's also a desire
> to have it participate in record compression, for which we don't seem
> to have any kind of plan.
>
> 5. Andres suggested multiple ideas for cleaning up and improving this
> code in https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de
> - which include the idea currently under discussion, several of the
> same ideas that I mentioned above, and a number of other things, such
> as making packing serialize to a char * rather than some ad-hoc
> intermediate format
I have implemented this patch. I will post this along with other changes.
and having a metadata array over which we can loop
> rather than having multiple places where there's a separate bit of
> code for every field type. I don't think those suggestions are
> entirely unproblematic; for instance, the metadata array would would
> probably work a lot better if we moved the transaction and log-switch
> headers outside of individual records as suggested in (2) above.
> Otherwise, the metadata would have to include not only data-structure
> offsets but some kind of a flag indicating which of several data
> structures ought to contain the relevant information, which would make
> the whole thing a lot messier. And depending on what we do about (4),
> this might become moot or the details might change quite a bit,
> because if we no longer have a long list of "generic" fields, then we
> also won't have a bunch of places in the code that deal with that long
> list of generic fields, which means the metadata array might not be
> necessary, or might be simpler or smaller or designed differently.
> All of which is to make the point that responding to Andres's feedback
> will require a bunch of decisions about which parts of the feedback to
> take (because some of them are mutually exclusive, as he acknowledges
> himself) and what to do about them (because some of them are vague);
> yet, taken together, they seem to amount to the need for significant
> design changes, as do (1)-(4).
>
> Now, just to be clear, the code we're talking about here is mostly
> based on an original design by me, and whatever defects were present
> in that original design are nobody's fault but mine. And that list of
> defects includes pretty much everything in the above list. But, what
> we need to figure out at this point is how we're going to get those
> things fixed, and it seems to me that we're going to need a pretty
> substantial redesign, but this discussion is kind of down in the
> weeds. I mean, what are we gaining by arguing about how many stages
> we need for incremental unpacking if the real thing we need to is get
> rid of that concept altogether?
Actually, In my local changes, I have already got rid of the multiple
stages because I am packing all fields in one char * as suggested in
the first part of the 4). I had a problem while unpacking because we
don't know the complete size of the record beforehand especially
because of the payload data. I have suggested a couple of points
above as part of 1) for handling the payload size.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2019-08-22 04:28:23 | Re: POC: Cleaning up orphaned files using undo logs |
Previous Message | Noah Misch | 2019-08-22 04:08:09 | Re: [HACKERS] WAL logging problem in 9.4.3? |