Re: [External] Re: FDW, too long to run explain

From: Vijaykumar Jain <vjain(at)opentable(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: [External] Re: FDW, too long to run explain
Date: 2019-02-17 18:51:42
Message-ID: CAE7uO5igJ-Vw3tn8eoQ11yPrpy6T_1J_a+=EbvMQDhQpqu4nyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Assuming your questions as 1,2,3, please find my answers below.

1)"explain" on foreign servers run as "idle in transactions". coz they were
running very long (in the order of some minutes) , pgbouncer (in tx level
pooling) setting kill them (as idle in tx time limit exceeded of 5 mins) or
else results in too many connections piling up.
2)yes, i get those from pg_stat_activity, it truncates the full statement,
but it shows up as * EXPLAIN select col1, col2 .... * 00:00:44 | idle in
transaction (this is just one of the screenshots i have). (on the foreign
side)
3)yes, i think we kind of understood that part (fetch and memory), but i am
not sure if that is used as any hint in plan generation too. i am sorry, i
did not put auto explain on, on foreign servers, as that required a restart
of the server.

(this is the real content of the screenshot ,yes 13 mins), masking the
colname and viewname
20678 | 00:13:38.990025 | EXPLAIN SELECT cols from view | idle in
transaction

the explain analyze of the same query on the foreign server is in ms.

I am sorry, i am vague about the queries in the email. i cannot reproduce
it, as we do not have multiple shards of 500G in my qa environment and i
cannot take dump of prod to test that in our test env coz of gdpr :)
but as i said in the mail, we were speculating since limit was not passed,
the plans may have been bad. We tricked the foreign server by using a
sample column to fake limit push down, and now have improved response
times. We made vaccum/analyze very aggressive to ensure stats are never
stale after large updates or deletes.

Unless someone can else reproduce, I guess, i'll close this mail. (I'll try
to reproduce it myself again, but for now i have less data to share to
convince anyone that happened.

Regards,
Vijay

On Sun, Feb 17, 2019 at 11:11 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Mon, Feb 4, 2019 at 2:15 AM Vijaykumar Jain <vjain(at)opentable(dot)com>
> wrote:
>
>>
>> now we have some long running queries via FDW that take minutes and get
>> killed explain runs as idle in transaction on remote servers.
>>
>
> Are you saying the EXPLAIN itself gets killed, or execution of the plan
> generated based on the EXPLAIN (issued under use_remote_estimate = true)
> gets killed? Who is doing the killing, the local side or the foreign
> side? Can you include verbatim log entries for this?
>

explain on foreign servers run as "idle in transactions". coz they were
running very long (in the order of some minutes) , pgbouncer setting kill
them (as idle in tx time limit exceeded of 5 mins) or else results in too
many connections piling up.

>
>> now the explain for
>> select * from sh01.view1 keeps running for minutes sometimes,
>>
>> then fetch too keeps running for minutes, although the total rows are <
>> 10000 maybe.
>> idle in transaction | FETCH 10000 FROM c1
>>
>
> What is this? Is it from some monitoring tool, or pg_stat_activity, or
> what? And is it on the local side or the foreign side?
>
yes, pg_stat_activity, it truncates the full statement, but it shows up as
* EXPLAIN select col1, col2 .... * 00:00:44 | idle in transaction (this is
just one of the screenshots i have). (on the foreign side)

>
>
>> other questions:
>> also, what is the cost of fetch_size?
>>
>
> It will always fetch rows from the foreign server in this sized chunks. A
> larger fetch_size will have less network latency and computational overhead
> if many rows are going to be consumed, but also consume more memory on the
> local server as all rows are stored in memory per each chunk. Also, in the
> case of a LIMIT, it reads a large number of rows even if most of them may
> be unneeded. Conceptually, the LIMIT could be used to modify the FETCH
> downward to match the LIMIT, but that is not implemented. In the case of a
> view over UNION ALL, I don't think the individual subqueries even know what
> the global LIMIT is.
>
Yep, i guess that is where i think the plan may have

>
>
>>
>> I mean given we have a query
>>
>> select * from foobar limit 10000; via FDW
>> limit 10000 does not get pushed.
>> so it seems all rows some to FDW node and then limit is applied?
>>
>
> It should not read all rows. It should read as many multiples of
> fetch_size as needed, which should just be 1 multiple in this case.
>
Yep, i think we kind of understood that part, but i am not sure if that is
used to generate the plan too. i am sorry, i did not put auto explain on,
on foreign servers, as that required a restart of the server.

>
> Cheers,
>
> Jeff
>
>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2019-02-17 18:54:54 Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2
Previous Message Jeff Janes 2019-02-17 17:41:18 Re: FDW, too long to run explain