From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Stephen Frost <sfrost(at)snowman(dot)net> |
Cc: | Kevin Grittner <kgrittn(at)ymail(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Steve Singer <steve(at)ssinger(dot)info>, Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: record identical operator |
Date: | 2013-09-23 16:27:24 |
Message-ID: | CA+TgmobX=O9LOsh9CWFQ5Gh0bHQNO4ZHJZ4Jj+hWuAdo3Csrvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Sep 20, 2013 at 11:23 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> Perhaps I'm assuming things are farther along than they are.. I was
> assumed that 'incremental mat view' updates were actuallly 'partial'-
> iow, that it keeps track of the rows in the mat view which are
> impacted by the set of rows in the base relations and then runs specific
> queries to pull out and operate on the base-rel set-of-rows to update
> the specific mat-view-rows. If it's running the whole query again then
> it's certainly more likely to get the same results that the user did,
> but it's not a guarantee and that's only a happy coincidence while we're
> still doing the whole query approach (which I hope we'd eventually
> progress beyond..).
It seems odd to me that you have such strong opinions about what is
and is not acceptable here given that you don't seem to familiar with
the current state of this work. I will attempt to summarize my
understanding. In 9.3, we can refresh a materialized view by taking
an access exclusive lock on the relation, rerunning the query, and
replacing the contents wholesale. In master, there is a new option to
perform the refresh concurrently, which is documented here:
http://www.postgresql.org/docs/devel/static/sql-refreshmaterializedview.html
It reruns the query in its entirety and then figures out what inserts,
updates, and deletes are needed to make the matview match the output
of the query (see refresh_by_match_merge). This is an improvement
over what is available in 9.3, because even though you still have to
rerun the full query, you don't have to lock out access to the table
in order to apply the changes. However, currently, it sometimes fails
to perform updates that are needed to make the contents of the view
match the query output, because it relies on a notion of equality
other than exact equality. Kevin is proposing to fix this problem via
this patch.
Now, the next project Kevin's going to work on, and that he was
working on when he discovered this problem, is incremental
maintenance: that is, allowing us to update the view *without* needing
to rerun the entire query. This record comparison operator will be
just as important in that context. The *only* strategy refreshing a
materialized view that *doesn't* need this operator is the only we
have in 9.3: through out all the old data and replace it with the
result of re-executing the query. If you want anything smarter than
that, you MUST compare old and new rows for equality so that you can
update only those rows that have been changed. And if you compare
them *any strategy other than the one Kevin is proposing*, namely
binary equality, then you may sometimes decide that a row has not been
changed when it really has, and then you won't update the row, and
then incremental maintenance will be enabled to produce *wrong
answers*. So to me this has come to seem pretty much open and shut.
We all know that materialized views are not going to always match the
data returned by the underlying query. Perhaps the canonical example
is SELECT random(). As you pointed out upthread, any query that is
nondeterministic is a potential problem for materialized views. When
you write a query that can return different output based on the order
in which input rows are scanned, or based on any sort of external
state such as a random-number generator, each refresh of a
materialized view based on that query may produce different answers.
There's not a lot we can do about that, except tell people to avoid
using such queries in materialized views that they expect to be
stable. However, what we're talking about here is a completely
separate problem. Even if the query is 100% deterministic, without
this patch, the materialized view can get out of sync with the query
results.
Granted, most of the ways in which it can get out of sync are fairly
subtle: failing to preserve case in a data type where comparisons are
text-insensitive; gaining or loosing an all-zeroes null bitmap on an
array type; adding or removing trailing zeroes after the decimal point
in a numeric. If the materialized view sometimes said "1" when the
query was returning "0", we'd presumably all say "that's a bug, let's
fix it". But what we're actually talking about is that the query
returns "0.00" and the view still says zero. So we're doing a lot of
soul-searching about whether that's unequal enough to justify updating
the row. Personally, though, there's not a lot of doubt in my mind.
If I create a table and I put 0 into a column of that table and then
create a materialized view and that 0 ends up in the materialized
view, and then I update the 0 to 0.00 and refresh the view, I expect
that change to propagate through to the materialized view. It works
that way if I select from a regular, non-materialized view; and it
also works that way if I select from the table itself. The idea that
materialized views should somehow be exempt from reflecting changes to
the underlying data in certain corner cases seems odd and indefensible
to me, and I can't understand why anyone's arguing that we should do
that.
If we're going to avoid that, we need this operator. We can argue
about how it should be named and whether it should be documented, and
we can have all those arguments and still fix the problem. But if we
decide we're not going to add this operator, then that seems to be
basically saying that we don't want to allow materialized views to
accurately reflect the results of the underlying queries. And I think
that would be an extremely poor decision.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Mitsumasa KONDO | 2013-09-23 16:28:51 | Re: gaussian distribution pgbench |
Previous Message | Pavel Stehule | 2013-09-23 16:18:28 | Re: Improving avg performance for numeric |