Re: The plan for FDW-based sharding

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: The plan for FDW-based sharding
Date: 2016-02-27 07:29:29
Message-ID: 56D15059.7080403@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/27/2016 06:57 AM, Robert Haas wrote:
> On Sat, Feb 27, 2016 at 1:49 AM, Konstantin Knizhnik
> <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>> pg_tsdtm is based on another approach: it is using system time as CSN and
>> doesn't require arbiter. In theory there is no limit for scalability. But
>> differences in system time and necessity to use more rounds of communication
>> have negative impact on performance.
> How do you prevent clock skew from causing serialization anomalies?

If node receives message from "feature" it just needs to wait until this future arrive.
Practically we just "adjust" system time in this case, moving it forward (certainly system time is not actually changed, we just set correction value which need to be added to system time).
This approach was discussed in the article:
http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf
I hope, in this article algorithm is explained much better than I can do here.

Few notes:
1. I can not prove that our pg_tsdtm absolutely correctly implements approach described in this article.
2. I didn't try to formally prove that our implementation can not cause some serialization anomalies.
3. We just run various synchronization tests (including simplest debit-credit test which breaks old version of Postgtes-XL) during several days and we didn't get any inconsistencies.
4. We have tested pg_tsdtm both at single node, blade cluster and geographically distributed nodes (distance more than thousand kilometers: one server was in Vladivostok, another in Kaliningrad). Ping between these two servers takes about 100msec.
Performance of our benchmark drops about 100 times but there was no inconsistencies.

Also I once again want to notice that primary idea of the proposed patch was not pg_tsdtm.
There are well know limitation of this pg_tsdtm which we will try to address in future.
What we want is to include XTM API in PostgreSQL to be able to continue our experiments with different transaction managers and implementing multimaster on top of it (our first practical goal) without affecting PostgreSQL core.

If XTM patch will be included in 9.6, then we can propose our multimaster as PostgreSQL extension and everybody can use it.
Otherwise we have to propose our own fork of Postgres which significantly complicates using and maintaining it.

>> So there is no ideal solution which can work well for all cluster. This is
>> why it is not possible to develop just one GTM, propose it as a patch for
>> review and then (hopefully) commit it in Postgres core. IMHO it will never
>> happen. And I do not think that it is actually needed. What we need is a way
>> to be able to create own transaction managers as Postgres extension not
>> affecting its core.
> This seems rather defeatist. If the code is good and reliable, why
> should it not be committed to core?

Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters.
2. Even if such implementation exists, still the right way of it integration is Postgres should use kind of TM API.
I hope that everybody will agree that doing it in this way:

#ifdef PGXC
/* In Postgres-XC, stop timestamp has to follow the timeline of GTM */
xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp;
#else
xlrec.xact_time = xactStopTimestamp;
#endif

or in this way:

xlrec.xact_time = xactUseGTM ? xactStopTimestamp + GTMdeltaTimestamp : xactStopTimestamp;

is very very bad idea.
In OO programming we should have abstract TM interface and several implementations of this interface, for example
MVCC_TM, 2PL_TM, Distributed_TM...
This is actually what can be done with our XTM API.
As far as Postgres is implemented in C, not in C++ we have to emulate interfaces using structures with function pointers.
And please notice that there is completely no need to include DTM implementation in core, as far as it is not needed for everybody.
It can be easily distributed as extension.

I have that quite soon we can propose multimaster extension which should provides functionality similar with MySQL Gallera. But even right now we have integrated pg_dtm and pg_tsdtm with pg_shard and postgres_fdw, allowing to provide distributed
consistency for them.

>
>> All arguments against XTM can be applied to any other extension API in
>> Postgres, for example FDW.
>> Is it general enough? There are many useful operations which currently are
>> not handled by this API. For example performing aggregation and grouping at
>> foreign server side. But still it is very useful and flexible mechanism,
>> allowing to implement many wonderful things.
> That is true. And everybody is entitled to an opinion on each new
> proposed hook, as to whether that hook is general or not. We have
> both accepted and rejected proposed hooks in the past.
>

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2016-02-27 07:52:12 Re: Relation cache invalidation on replica
Previous Message Pavel Stehule 2016-02-27 07:26:19 Re: raw output from copy