Re: Ghost data from failed FDW transactions?

From: Jacob Biesinger <jake(dot)biesinger(at)gmail(dot)com>
To: Greg Sabino Mullane <htamfids(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Ghost data from failed FDW transactions?
Date: 2024-08-28 16:18:33
Message-ID: CAHYXj6etiV1pTBvBiQFZ=jqxvNHczZio9cSSb7gMy+RrWwZTTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Aug 28, 2024 at 5:39 AM Greg Sabino Mullane <htamfids(at)gmail(dot)com>
wrote:

> On Tue, Aug 27, 2024 at 9:03 PM Jacob Biesinger <jake(dot)biesinger(at)gmail(dot)com>
> wrote:
>
>> I'm scratching my head at a few rows in the root DB, where it seems the
>> corresponding tenant transaction rolled back, but the root DB transaction
>> committed
>>
> ...
>
>> Before I jump into particulars, does this sound like expected behavior?
>>
>
> No, it sounds like something is going wrong. Your setup as described
> should work to keep both sides in sync.
>

Well, that's a plus. At least we're (probably) not using it wrong :)

There aren't many details in the docs around failure modes... is there
anything there that could cause this issue?

For example, if there were some temporary network outage that prevented
communication between the two DBs, would the FDW fail gracefully? Or if
there were some disk failure or something after the FDW signals to the
remote (root) DB to commit but then the local (tenant) DB failed to commit?
We've had a few outages over the years where we hit the `max_connections`
setting on the cluster (which would affect both the root + tenant DBs), but
connections are held for the duration of both local + remote txns, so
doesn't seem like that would affect this. We don't use pgBouncer, either on
the client -> DB or as an in-between on the DB -> DB FDW side.

>
> Through the magic of postgres_fdw, row triggers, and distributed
>> transactions,
>>
>
> Can you expand on "distributed transactions" here?
>

I just mean "using the FDW as described". It is magic that we can get
proper transactional + ACID semantics in a distributed system. The FDW
really helps our use-case.

But to go deeper, we use the javascript knex adapter and some
application-level transaction management that automatically retries a
transaction N times when it encounters serialization errors. On this
particular endpoint, the emitted SQL for the full transaction looks
something like:

BEGIN;
INSERT INTO "devices" ("orgId", "patientId", "deviceId", "data")
VALUES (
'org1',
'patient1',
'device1',
'{"id": "device1", "patientId": "patient1", "serialNumber": "12345",
"status": "active" }'
);
INSERT INTO "devices" ("orgId", "patientId", "deviceId", "data")
VALUES (
'org1',
'patient1',
'device2',
'{"id": "device2", "patientId": "patient1", "serialNumber": "67890",
"status": "active" }'
);
SELECT * FROM "rootDb"."assets";

-- execute some logic client-side, nothing touching the DB

UPDATE "rootDb"."assets" WHERE ...;
COMMIT;

With I guess the maybe-relevant bits here being that we do some additional
reading + writing to the remote / root DB (other tables) subsequently as
part of the same txn. The JS driving this also has the unfortunate
shortcoming that the two `INSERT` statements run "in parallel", meaning
there's a race to execute them (serially) through their shared
txn/connection. The ordering shouldn't matter, but this also means
that error handling (e.g., when there is a conflict with the `EXCLUDE
"serialNumber"` constraint) may not stop the second `INSERT` statement from
being attempted (and rejected by postgres). But I think that's all
client-side details that shouldn't affect the FDW txn semantics, right?

--
Jake Biesinger

On Wed, Aug 28, 2024 at 5:39 AM Greg Sabino Mullane <htamfids(at)gmail(dot)com>
wrote:

> On Tue, Aug 27, 2024 at 9:03 PM Jacob Biesinger <jake(dot)biesinger(at)gmail(dot)com>
> wrote:
>
>> I'm scratching my head at a few rows in the root DB, where it seems the
>> corresponding tenant transaction rolled back, but the root DB transaction
>> committed
>>
> ...
>
>> Before I jump into particulars, does this sound like expected behavior?
>>
>
> No, it sounds like something is going wrong. Your setup as described
> should work to keep both sides in sync.
>
> Through the magic of postgres_fdw, row triggers, and distributed
>> transactions,
>>
>
> Can you expand on "distributed transactions" here?
>
> Cheers,
> Greg
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2024-08-28 20:47:47 Re: Significant Execution Time Difference Between PG13.14 and PG16.4 for Query on information_schema Tables.
Previous Message David G. Johnston 2024-08-28 14:58:04 Re: PgbackRest : Stanza creation fails on DB Server and Repo Server