Quick Links

Re: Why we lost Uber as a user

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Alfred Perlstein <alfred(at)freebsd(dot)org>, Geoff Winkless <pgsqladmin(at)geoff(dot)dj>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Why we lost Uber as a user
Date:	2016-08-03 19:05:12
Message-ID:	CA+Tgmoaee_23Ms5hasOT0tzBX12mcA5Ont-8hQ2Cvr4ggbzerg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I don't think they are saying that logical replication is more
>> reliable than physical replication, nor do I believe that to be true.
>> I think they are saying that if logical corruption happens, you can
>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
>> affected rows, whereas if physical corruption happens, there's no
>> equally clear path to recovery.
>
> Well, that's not an entirely unreasonable point, but I dispute the
> implication that it makes recovery from corruption an easy thing to do.
> How are you going to know what SQL statements to issue? If the master
> database is changing 24x7, how are you going to keep up with that?

I think in many cases people fix their data using business logic. For
example, suppose your database goes down and you have to run
pg_resetxlog to get it back up. You dump-and-restore, as one does,
and find that you can't rebuild one of your unique indexes because
there are now two records with that same PK. Well, what you do is you
look at them and judge which one has the correct data, often the one
that looks more complete or the one with the newer timestamp. Or,
maybe you need to merge them somehow. In my experience helping users
through problems of this type, once you explain the problem to the
user and tell them they have to square it on their end, the support
call ends. The user may not always be entirely thrilled about having
to, say, validate a problematic record against external sources of
truth, but they usually know how to do it. Database bugs aren't the
only way that databases become inaccurate. If the database that they
use to keep track of land ownership in the jurisdiction where I live
says that two different people own the same piece of property,
somewhere there is a paper deed in a filing cabinet. Fishing that out
to understand what happened may not be fun, but a DBA can explain that
problem to other people in the organization and those people can get
it fixed. It's a problem, but it's fixable.

On the other hand, if a heap tuple contains invalid infomask bits that
cause an error every time you read the page (this actually happened to
an EnterpriseDB customer!), the DBA can't tell other people how to fix
it and can't fix it personally either. Instead, the DBA calls me.
While I try to figure out what happened and solve the problem, every
sequential scan on that table fails, so the customer is basically
down. In contrast, in the logical corruption scenario, one record
might be wrong, but basically everything is still working. So it's a
difference between a problem that the DBA can work with coworkers to
fix while the system is up, and a problem that the DBA can't fix and
the system is meanwhile down. That's a big difference.

> I think the realistic answer if you suffer replication-induced corruption
> is usually going to be "re-clone that slave", and logical rep doesn't
> really offer much gain in that.

If you're using multi-master replication, the notion of what's a slave
gets a bit fuzzy, but, apart from that, yes, this is often the
solution. However, even here, logical replication can be better.
Given the right tools, I can fix up the slave incrementally, comparing
it to the master row by row and updating anything that's wrong. If I
have to rebuild a physical master, I'm offline. The difference
doesn't matter if the slave is so badly corrupted that it's unusable,
but it's very common for corruption to involve only a handful of
records, and many users not unreasonably prefer a database with a
couple of corrupted records to one which is totally down. "Hey, the
payroll record for that Tom Lane guy is messed up, don't cut his
paycheck until we get that straightened out." "OK, no problem."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Why we lost Uber as a user at 2016-08-03 18:23:36 from Tom Lane

Responses

Re: Why we lost Uber as a user at 2016-08-04 09:00:25 from Torsten Zuehlsdorff

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Álvaro Hernández Tortosa	2016-08-03 19:13:18	Re: Implementing full UTF-8 support (aka supporting 0x00)
Previous Message	Joshua D. Drake	2016-08-03 18:56:04	Re: Why we lost Uber as a user