Quick Links

Re: Pluggable Storage - Andres's take

From:	Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Pluggable Storage - Andres's take
Date:	2019-05-16 06:00:38
Message-ID:	CALfoeiu+69FZWDCiH_qw8rBTNBKAuy3ttGkTotmWgx1-L0Kb6w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
>
> > > 3) nodeTidscan, skipping over too large tids
> > > I think this should just be moved into the AMs, there's no need to
> > > have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> > /*
> > * We silently discard any TIDs that are out of range at the time
> of scan
> > * start. (Since we hold at least AccessShareLock on the table,
> it won't
> > * be possible for someone to truncate away the blocks we intend to
> > * visit.)
> > */
> > nblocks =
> RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> >
> >
> > The obvious answer would be to just move that check into the
> > table_fetch_row_version implementation (currently just calling
> > heap_fetch()) - but that doesn't seem OK from a performance POV, because
> > we'd then determine the relation size once for each tid, rather than
> > once per tidscan. And it'd also check in cases where we know the tid is
> > supposed to be valid (e.g. fetching trigger tuples and such).
> >
> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter. But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
>
> Attached is a prototype of a variation of this. I added a
> table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
> callback / wrapper. Currently it just takes a "plain" scan, but we could
> add a separate table_beginscan variant too.
>
> For heap that just means we can just use HeapScanDesc's rs_nblock to
> filter out invalid tids, and we only need to call
> RelationGetNumberOfBlocks() once, rather than every
> table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
> improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
> CURRENT OF) - which previously computed the relation size once per
> tuple.
>
> Needs a bit of polishing, but I think this is the right direction?
>

Highlevel this looks good to me. Will look into full details tomorrow. This
alligns with the high level thought I made but implemented in much better
way, to consult with the AM to perform the optimization or not. So, now
using the new callback table_tuple_tid_valid() AM either can implement some
way to perform the validation for TID to optimize the scan, or if has no
way to check based on scan descriptor then can decide to always pass true
and let table_fetch_row_version() handle the things.

In response to

Re: Pluggable Storage - Andres's take at 2019-05-15 18:54:47 from Andres Freund

Responses

Re: Pluggable Storage - Andres's take at 2019-05-17 19:54:19 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2019-05-16 06:23:40	Re: New EXPLAIN option: ALL
Previous Message	Ashwin Agrawal	2019-05-16 05:32:34	Re: Adding a test for speculative insert abort case