Quick Links

Re: Asynchronous Append on postgres_fdw nodes.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	etsuro(dot)fujita(at)gmail(dot)com
Cc:	pryzby(at)telsasoft(dot)com, a(dot)lepikhov(at)postgrespro(dot)ru, movead(dot)li(at)highgo(dot)ca, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Asynchronous Append on postgres_fdw nodes.
Date:	2020-12-14 07:01:15
Message-ID:	20201214.160115.2122208463905961084.horikyota.ntt@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote in
> On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote in
> > > * In Robert's patch [1] (and Horiguchi-san's, which was created based
> > > on Robert's), ExecAppend() was modified to retrieve tuples from
> > > async-aware children *before* the tuples will be needed, but I don't
> >
> > The "retrieve" means the move of a tuple from fdw to executor
> > (ExecAppend or ExecAsync) layer?
>
> Yes, that's what I mean.
>
> > > think that's really a good idea, because the query might complete
> > > before returning the tuples. So I modified that function so that a
> >
> > I'm not sure how it matters. Anyway the fdw holds up to tens of tuples
> > before the executor actually make requests for them. The reason for
> > the early fetching is letting fdw send the next request as early as
> > possible. (However, I didn't measure the effect of the
> > nodeAppend-level prefetching.)
>
> I agree that that would lead to an improved efficiency in some cases,
> but I still think that that would be useless in some other cases like
> SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
> would get worse if we support Append on top of joins or aggregates
> over ForeignScans, which would be more expensive to perform than these
> ForeignScans.

I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
for many times is more common than fetching all or "LIMIT <many
multiples of fetch_size>", that discussion would be convincing... Is
it really the case?

Since core knows of async execution, I think if we disable async
exection, it should be decided by planner, which knows how many tuples
are expected to be returned. On the other hand the most apparent
criteria for whether to enable async or not would be fetch_size, which
is fdw's secret. Thus we could rename ForeignPathAsyncCapable() to
something like ForeignPathRunAsync(), true from which means "the FDW
is telling that it can run async and is thinking that the given number
of tuples will be fetched at once.".

> If we do prefetching, I think it would be better that it’s the
> responsibility of the FDW to do prefetching, and I think that that
> could be done by letting the FDW to start another data fetch,
> independently of the core, in the ForeignAsyncNotify callback routine,

FDW does prefetching (if it means sending request to remote) in my
patch, so I agree to that. It suspect that you were intended to say
the opposite. The core (ExecAppendAsyncGetNext()) controls
prefetching in your patch.

> which I revived from Robert's original patch. I think that that would
> be more efficient, because the FDW would no longer need to wait until
> all buffered tuples are returned to the core. In the WIP patch, I

I don't understand. My patch sends a prefetch-query as soon as all the
tuples of the last remote-request is stored into FDW storage. The
reason for removing ExecAsyncNotify() was it is just redundant as far
as concerning Append asynchrony. But I particulary oppose to revive
the function.

> only allowed the callback routine to put the corresponding ForeignScan
> node into a state where it’s either ready for a new request or needing
> a callback for another data fetch, but I think we could probably relax
> the restriction so that the ForeignScan node can be put into another
> state where it’s ready for a new request while needing a callback for
> the prefetch.

I don't understand this, too. ExecAsyncNotify() doesn't touch any of
the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
your patch. Am I looking into something wrong? I'm looking
async-wip-2020-11-17.patch.

(By the way, it is one of those that make the code hard to read to me
that the "callback" means "calling an API function". I think none of
them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
"callback".)

> > > tuple is retrieved from an async-aware child *when* it is needed, like
> > > Thomas' patch. I used FDW callback functions proposed by Robert, but
> > > introduced another FDW callback function ForeignAsyncBegin() for each
> > > async-aware child to start an asynchronous data fetch at the first
> > > call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().
> >
> > Even though the terminology is not officially determined, in the past
> > discussions "async-aware" meant "can handle async-capable subnodes"
> > and "async-capable" is used as "can run asynchronously".
>
> Thanks for the explanation!
>
> > Likewise you
> > seem to have changed the meaning of as_needrequest from "subnodes that
> > needs to request for the next tuple" to "subnodes that already have
> > got query-send request and waiting for the result to come".
>
> No. I think I might slightly change the original definition of
> as_needrequest, though.

Mmm, sorry. I may have been perplexed by the comment below, which is
also added to ExecAsyncNotify().

ExecAppendAsyncRequest:
> Assert(bms_is_member(i, node->as_needrequest));
>
> /* Perform the actual callback. */
> ExecAsyncRequest(areq);
> if (ExecAppendAsyncResponse(areq))
> {
> Assert(!TupIsNull(areq->result));
> *result = areq->result;
> return true;
> }

> > I would
> > argue to use the words and variables (names) in such meanings.
>
> I think the word "aware" has a broader meaning, so the naming as
> proposed would be OK IMO. But actually, I don't have any strong
> opinion about that, so I'll change it as explained.

Thanks.

> > > * For EvalPlanQual, I modified the patch so that async-aware children
> > > are treated as if they were synchronous when executing EvalPlanQual.
> >
> > Doesn't async execution accelerate the epq-fetching? Or does
> > async-execution goes into trouble in the EPQ path?
>
> The reason why I disabled async execution when executing EPQ is to
> avoid sending asynchronous queries to the remote sides, which would be
> useless, because scan tuples for an EPQ recheck are obtained in a
> dedicated way.

If EPQ is performed onto Append, I think it should gain from
asynchronous execution since it is going to fetch *a* tuple from
several partitions or children. I believe EPQ doesn't contain Append
in major cases, though. (Or I didn't come up with the steps for the
case to happen...)

> > > * In Robert's patch, all async-aware children below Append nodes in
> > > the query waiting for events to occur were managed by a single EState,
> > > but I modified the patch so that such children are managed by each
> > > Append node, like Horiguchi-san's patch and Thomas'.
> >
> > Managing in Estate give advantage for push-up style executor but
> > managing in node_state is simpler.
>
> What do you mean by "push-up style executor"?

The reverse of the volcano-style executor, which enters from the
topmost node and down to the bottom. In the "push-up stule executor",
the bottom-most nodes fires by a certain trigger then every
intermediate nodes throws up the result to the parent until reaching
the topmost node.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: Asynchronous Append on postgres_fdw nodes. at 2020-12-12 09:25:57 from Etsuro Fujita

Responses

Re: Asynchronous Append on postgres_fdw nodes. at 2020-12-19 08:55:22 from Etsuro Fujita

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2020-12-14 07:45:08	Re: REINDEX backend filtering
Previous Message	Joel Mariadasan (jomariad)	2020-12-14 06:58:04	RE: pg_ctl.exe file deleted automatically