Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enhance file_fdw to report processed and skipped tuples in COPY progress
Date: 2024-10-11 06:36:45
Message-ID: 20241011153645.a348de1576a3f57092c68355@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 11 Oct 2024 10:53:10 +0900
Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:

>
>
> On 2024/10/04 2:12, Masahiko Sawada wrote:
> > Hi,
> >
> > On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>
> >> Hi,
> >>
> >> Currently, file_fdw updates several columns in the pg_stat_progress_copy view,
> >> like relid and bytes_processed, but it doesn't track tuples_processed or
> >> tuples_skipped. Monitoring these would be particularly useful when handling
> >> large data sets via file_fdw, as it helps track the progress of scan.
> >>
> >> The attached patch updates file_fdw to add support for reporting
> >> the number of tuples processed and skipped (due to on_error = 'ignore')
> >> in the pg_stat_progress_copy view. What are your thoughts?
> >
> > While the patch works fine and looks good to me, in the first place,
> > it seems to me that the fact that file_fdw uses the COPY progress
> > itself doesn't work properly. For example, unlike COPY command,
> > queries could have multiple scans on one or more flie_fdw foreign
> > tables when joining tables. I found the discussion for that[1]: there
> > was a proposal of disabling COPY progress for file_fdw but the votes
> > are split. I think it would be better to consider if we really want to
> > support COPY progress for file_fdw before supporting more progress
> > information.
>
> Yes, you're right. We need to address how to handle multiple commands
> that trigger progress reporting when executed concurrently, at first.
>
> The current progress reporting mechanism assumes only one command
> triggering progress is running at a time, as each backend has just
> one memory area for progress reporting. If multiple commands run simultaneously,
> the progress data would be incorrect. As you mentioned, this could happen
> when querying multiple file_fdw foreign tables, where multiple COPY commands
> could execute concurrently.
>
> However, this issue already exists without the proposed patch.
> Since file_fdw already reports progress partially, querying multiple
> file_fdw tables can lead to inaccurate or confusing progress reports.
> You can even observe this when analyzing a file_fdw table and also
> when copying to the table with a trigger that executes progress-reporting
> commands.
>
> So, I don’t think this issue should block the proposed patch.
> In fact, progress reporting is already flawed in these scenarios,
> regardless of whether the patch is applied.
>
> On the other hand, in many cases where a single file_fdw table is scanned,
> COPY progress reporting works correctly for file_fdw and is useful.
> Therefore, I believe it's still worth improving file_fdw’s progress reporting.

I think reporting tuples_processed and tuples_skipped columns additionally
in file_fdw is reasonable, since it already reports bytes_processed and bytes_total.

By the way, in the documentation of fild_fdw, it is not explicitly described
that file_fdw uses COPY internally, although I can find several wordings like "as COPY".
To prevent users to face unexpected experiences, how about explaining explicitly that
file_fdw uses COPY and updates pg_stat_progress_copy?

> To prevent misleading reports when multiple commands are run concurrently,
> just idea, we could consider displaying NULL columns in the progress report
> if this situation is detected, as a separate patch.

Or, could we add additional field to pg_stat_progress_copy to
show how much commands are running COPY? It is also just idea.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2024-10-11 06:44:40 Re: On disable_cost
Previous Message Andrei Lepikhov 2024-10-11 06:21:37 Re: allowing extensions to control planner behavior