Re: counting algorithm for incremental matview maintenance

From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: counting algorithm for incremental matview maintenance
Date: 2013-05-17 20:33:06
Message-ID: CAP-rdTb5m=4P-ibNM20hjsCccNobnRFdf+i1GC-MjamFT=Hfqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2013/5/17 Kevin Grittner <kgrittn(at)ymail(dot)com>:

> Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com> wrote:
>
>> Note that the basic count algorithm assumes real-serializable
>> transactions for correctness. Example:

[..]

> Good point.
>
> It might be hard to detect when this type of race condition exists,
> since it could be hidden inside a function. Any thoughts on that?

I think that matviews with defining queries that use functions for
which the system can not prove whether they are safe for the given
incremental maintenance technique (e.g., count algorithm) and
situation (e.g., mandatory isolation level), should not be allowed to
use incremental maintenance of that type.

Although I have only limited practical matview experience (with SQL
Server (2005?) only, there it is called “indexed views”): I assume
that in most other DBMSs, the queries that can be used for defining
incrementally maintained matviews are very limited (they are in SQL
Server).

>> Each of the following would fix this problem:
>>
>> (2) The matview’s defining query and the corresponding base tables
>> must adhere to certain (very commonly satisfied) conditions (for
>> example, if there is an FK between A.a and B.b, the example cannot be
>> made to fail; The FK must not be deferred in the case of
>> right-after-each-statement matview maintenance).
>
> Agreed. That would cover a high percentage of cases regardless of
> transaction isolation level. Knowing when you needed such a
> constraint to make incremental maintenance safe would be hard,
> though.

Reasoning about it becomes easier when you have a clean definition of
which queries you accept, and which ones you don’t (see below). Then
you can look at your definition and say “I allow X because it is
always correct I don’t allow Y because I didn’t prove it to be always
correct (yet).”

>> (3) The count algorithm must be implemented in a way that understands
>> MVCC internals: Reading the base tables must be done using a technique
>> that reads all rows (i.e., also the ones not visible to the current
>> transaction), and the xmin/xmax/etc of the updated rows in the matview
>> must be set in a smart way, so as to yield visibility results that
>> correspond to the visibility of the “originating” rows in the base
>> tables. Calculation of the matview deltas (especially the part where
>> the base tables are inspected for joining rows) must in this case
>> probably be done in a serialized manner.
>
> I will look at this option some more, but on the face of it, I'm
> not quite following how it would work; and it sounds invasive and
> fragile. If you know of any paper on this approach you could point
> me to, or if you could sketch it out in a little more detail, I
> would appreciate that.

Sorry, these are my own musings: I have never found any paper about it.

After a bit more thought, I think it wouldn’t be so easy for the count
algorithm: It would be necessary to express (using physical MVCC rows)
something like “you can only see this (logical) row if both
transaction T1 and T2 are visible to you.” Disjunction would be easy:
create two physical rows with the same data, one that is visible if T1
is visible, and one that is visible if T2 is visible. Unfortunately,
we need conjunction here :-(.

(Upgrading the expressiveness of the MVCC-related information in the
rows would help, but as that is more invasive than anyone wants to
think about, I restrain those thoughts to the confines my own head
;-).)

OTOH, I think that this technique could be used in the case of simple
“aggregation” matviews (without joins, or *maybe* when all joins are
constrained by FKs). (I assume an implementation such as the one that
I sketched in a previous post[1] a few months back.)

[1] <URL:http://www.postgresql.org/message-id/CAP-rdTY+VZ8T_8JBvsksDRdz-_wr5ef80M-_UKgu5bHw+rkjpg@mail.gmail.com>

> There will, at least for the next several releases, be matviews
> defined with queries for which we cannot support incremental
> maintenance -- REFRESH will continue to be the only way to deal
> with these until we enhance incremental maintenance to support them.
> For example, in 9.4 I don't intend to support incremental
> maintenance of matviews defined with recursive queries. To me it
> makes sense to prohibit incremental maintenance of a matview for
> which there is no strategy to deal with concurrency problems.

+1

I think you need a strict definition, specific to your incremental
maintenance algorithm, of which queries you accept and which ones you
don’t. Such a definition could be based on SQL syntax, or on any of
the intermediate forms that are produced by the
parser/rewriter/analyzer/planner/etc. I think that basing your
definition on a later processing stage (e.g., after subqueries are
flattened, after views are expanded, or after “redundant” predicates
are removed) would increase the amount of queries that are allowed,
but might make it somewhat more difficult for a user to predict
whether a query allows incremental mantenance or not. (It might also
make your definition depend on “internal stuff” that might change from
one release to the next.)

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2013-05-17 20:46:05 Re: [PATCH]Tablesample Submission
Previous Message Claudio Freire 2013-05-17 19:53:26 Re: counting algorithm for incremental matview maintenance