Re: Incremental View Maintenance, take 2

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Incremental View Maintenance, take 2
Date: 2024-08-19 21:14:08
Message-ID: CALdSSPiBBBCHxVtg+X6OdBkJPGYOvLf1hST4MgBgRKZh0Xddyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 30 Jul 2024 at 10:24, Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:
>
> Hi,
>
> On Tue, 30 Jul 2024 03:32:19 +0500
> Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>
> > On Sat, 27 Jul 2024 at 13:26, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
> > >
> > > Hi!
> > > Cloudberry DB (Greenplum fork) uses IMMV feature for AQUMV (auto query
> > > use matview) feature, so i got interested in how it is implemented.
>
> Thank you so much for a lot of comments!
> I will respond to the comments soon.
>
> > >
> > > On Thu, 11 Jul 2024 at 09:24, Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:
> > > >
> > > > I updated the patch to bump up the version numbers in psql and pg_dump codes
> > > > from 17 to 18.
> > >
> > > Few suggestions:
> > >
> > > 1) `Add-relisivm-column-to-pg_class-system-catalog` commit message
> > > should be fixed, there is "isimmv" in the last line.
> > > 2) I dont get why `Add-Incremental-View-Maintenance-support.patch`
> > > goes after 0005 & 0004. Shoulndt we first implement feature server
> > > side, only when client (psql & pg_dump) side?
> > > 3) Can we provide regression tests for each function separately? Test
> > > for main feature in main patch, test for DISTINCT support in
> > > v34-0007-Add-DISTINCT-support-for-IVM.patch etc? This way the patchset
> > > will be easier to review, and can be committed separelety.
> > > 4) v34-0006-Add-Incremental-View-Maintenance-support.patch no longer
> > > applies due to 4b74ebf726d444ba820830cad986a1f92f724649. After
> > > resolving issues manually, it does not compile, because
> > > 4b74ebf726d444ba820830cad986a1f92f724649 also removes
> > > save_userid/save_sec_context fields from ExecCreateTableAs.
> > >
> > > > if (RelationIsIVM(matviewRel) && stmt->skipData)
> > > Now this function accepts skipData param.
> > >
> > > 5) For DISTINCT support patch uses hidden __ivm* columns. Is this
> > > design discussed anywhere? I wonder if this is a necessity (only
> > > solution) or if there are alternatives.
> > > 6)
> > > What are the caveats of supporting some simple cases for aggregation
> > > funcs like in example?
> > > ```
> > > regress=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_2 AS SELECT
> > > sum(j) + sum(i) from mv_base_a;
> > > ERROR: expression containing an aggregate in it is not supported on
> > > incrementally maintainable materialized view
> > > ```
> > > I can see some difficulties with division CREATE IMMV .... AS SELECT
> > > 1/sum(i) from mv_base_a; (sum(i) == 0 case), but adding &
> > > multiplication should be ok, aren't they?
> > >
> > >
> > > Overall, patchset looks mature, however it is far from being
> > > committable due to lack of testing/feedback/discussion. There is only
> > > one way to fix this... Test and discuss it!
> > >
> > >
> > > [1] https://github.com/cloudberrydb/cloudberrydb
> >
> > Hi! Small update: I tried to run a regression test and all
> > IMMV-related tests failed on my vm. Maybe I'm doing something wrong, I
> > will try to investigate.
> >
> > Another suggestion: support for \d and \d+ commands in psql. With v34
> > patchset applied, psql does not show anything IMMV-related in \d mode.
> >
> > ```
> > reshke=# \d m1
> > Materialized view "public.m1"
> > Column | Type | Collation | Nullable | Default
> > --------+---------+-----------+----------+---------
> > i | integer | | |
> > Distributed by: (i)
> >
> >
> > reshke=# \d+ m1
> > Materialized view "public.m1"
> > Column | Type | Collation | Nullable | Default | Storage |
> > Compression | Stats target | Description
> > --------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
> > i | integer | | | | plain |
> > | |
> > View definition:
> > SELECT t1.i
> > FROM t1;
> > Distributed by: (i)
> > Access method: heap
> >
> > ```
> >
> > Output should be 'Incrementally materialized view "public.m1"' IMO.
>
>
> --
> Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>

So, I spent another 2 weeks on this patch. I have read the whole
'Incremental View Maintenance' thread (from 2018), this thread, some
related threads. Have studied some papers on this topic. I got a
better understanding of the theory this work is backed up with.
However, I still can add my 2c.

== Major suggestions.

1) At first glance, working with this IVM/IMMV infrastructure feels
really unintuitive about what servers actually do for query execution.
I do think It will be much better for user experience to add more
EXPLAIN about IVM work done inside IVM triggers. This way it is much
clearer which part is working slow, so which index should be created,
etc.

2) The kernel code for IVM lacks possibility to be extended for
further IVM optimizations. The one example is foreign key optimization
described here[1]. I'm not saying we should implement this within this
patchset, but we surely should pave the way for this. I don't have any
good suggestions for how to do this though.

3) I don't really think SQL design is good. CREATE [INCREMENTAL] M.V.
is too ad-hoc. I would prefer CREATE M.V. with (maintain_incr=true).
(reloption name is just an example).
This way we can change regular M.V. to IVM and vice versa via ALTER
M.V. SET *reloptions* - a type of syntax that is already present in
PostgreSQL core.

== Other thoughts

In OLAP databases (see [2]), IVM opens the door for 'view
exploitation' feature. That is, use IVM (which is always up-to-date)
for query execution. But current IVM implementation is not compatible
with Cloudberry Append-optimized Table Access Method. The problem is
the 'table_tuple_fetch_row_version' call, which is used by
ivm_visible_in_prestate to check tuple visibility within a snapshot. I
am trying to solve this somehow. My current idea is the following:
multiple base table modification via single statement along with tuple
deletion from base tables are features. We can error-out these cases
(at M.V. creation time) all for some TAMs, and support only insert &
truncate. However, I don't know how to check if TAM supports
'tuple_fetch_row_version' other than calling it and receiving
ERROR[3].

== Minor nitpicks and typos.

reshke=# insert into tt select * from generate_series(1, 1000090);
^CCancel request sent
ERROR: canceling statement due to user request
CONTEXT: SQL statement "INSERT INTO public.mv1 (i, j) SELECT i, j
FROM (SELECT diff.*, pg_catalog.generate_series(1,
diff."__ivm_count__") AS __ivm_generate_series__ FROM new_delta AS
diff) AS v"
Time: 18883.883 ms (00:18.884)

This is very surprising, isn't it? We can set HINT here, to indicate
where this query comes from.

2)
deleted/deleted -> updated/deleted
+ /*
+ * XXX: When using DELETE or UPDATE, we must use exclusive lock for now
+ * because apply_old_delta(_with_count) uses ctid to identify the tuple
+ * to be deleted/deleted, but doesn't work in concurrent situations.

3) Typo in rewrite_query_for_postupdate_state:
/* Retore the original RTE */

4) in apply_delta function has exactly one usage, so the 'use_count'
param is redundant, because we already pass the 'query' param, and
'use_count' is calculated from the 'query'.

5) in calc_delta:

> ListCell *lc = list_nth_cell(query->rtable, rte_index - 1);
> RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
Should we add Assert(list_lenght(lc) == 1) here? Can there be multiple
items in this list?

6) In get_prestate_rte:
> appendStringInfo(&str,
> "SELECT t.* FROM %s t"
> " WHERE pg_catalog.ivm_visible_in_prestate(t.tableoid, t.ctid ,%d::pg_catalog.oid)",
> relname, matviewid);

Identitation issue. This will not be fixed via pg_ident run, because
this is str contant, so better so fix it by-hand.

7) in apply_new_delta_with_count:

> appendStringInfo(&querybuf,
> "WITH updt AS (" /* update a tuple if this exists in the view */
> "UPDATE %s AS mv SET %s = mv.%s OPERATOR(pg_catalog.+) diff.%s "

SET % OPERATOR(pg_catalog.=) mv.%s ?

same for append_set_clause_for_count, append_set_clause_for_sum,
append_set_clause_for_minmax

> /* avg = (mv.sum - t.sum)::aggtype / (mv.count - t.count) */
> appendStringInfo(buf_old,
> ", %s = %s OPERATOR(pg_catalog./) %s",

should be
/* avg OPERATOR(pg_catalog.=) (mv.sum - t.sum)::aggtype / (mv.count -
t.count) */
appendStringInfo(buf_old,
", %s = %s OPERATOR(pg_catalog./) %s",

[1] https://assets.amazon.science/a2/57/a00ebcfc446a9d0bf827bb51c15a/foreign-keys-open-the-door-for-faster-incremental-view-maintenance.pdf
[2] https://github.com/cloudberrydb/cloudberrydb
[3] https://github.com/cloudberrydb/cloudberrydb/blob/b9aec75154d5bbecce7ce3a33e8bb2272ff61511/src/backend/access/appendonly/appendonlyam_handler.c#L828

--
Best regards,
Kirill Reshke

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-08-19 21:46:10 Re: Thread-safe nl_langinfo() and localeconv()
Previous Message Jelte Fennema-Nio 2024-08-19 20:53:46 Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs