Faster more low-level methods of having hot standby / secondary read-only servers?

From: Gunther Schadow <raj(at)gusw(dot)net>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Faster more low-level methods of having hot standby / secondary read-only servers?
Date: 2022-09-13 09:42:40
Message-ID: dcd6a845-6ff3-0911-38ac-94d32db8c2e6@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I was on this DBA.StackExchange  question
https://dba.stackexchange.com/questions/316715/in-postgresql-repmgr-master-slave-environment-is-it-necessary-to-have-same-h-w-c/316756,
and which reminded me of my experience and disappointment with the
hot-standby design.

Say you have millions of users mostly querying. Then 1000 will also
create change transactions. It should be really cool to have not just
one but 10 stand-by's load balancing all the queries through PgPool-II
(or whenever that gets actually integrated into main PgSQL -- anyone
thinking about this? After all, PgPool-II essentially is 75% Postgresql
code, is it not?)

The problem I found was that WAL log processing requires a lot of
resources to the point where whatever work the master has done to turn
the transaction(s) into a WAL content, that seems insignificant compared
to the work left just applying the WAL to the data files. I know this
because I tried to run on a very insert-busy workload the stand-by on a
lesser hardware, hoping that it would be enough just to "keep up", but
it was not. I needed to use the same hardware configuration (I use AWS,
so it's easy to try out different ones).

What I then did to catch up, despite the files having grown quite big,
it was pretty fast to just use rsync on the data files themselves, and
quickly I was back in sync and could continue with processing the WAL
updates.

I think I had asked here (it's over 1 or 2 years ago) to confirm, and
the conclusion was that, sadly, you do not gain that much free server
power by using one master and several secondaries, because all those
secondaries will be quite busy handling the incoming WAL from the
master, that they have very little spare resources left to handle a
bunch of querying activity.

But this gave me an idea. Two ideas actually.

1. Use a shared device, also known as "cluster" filesystem, the point
being, the slave operates on the same physical drive. But then that
may cause contention with increased head-seek activity, which isn't
really an issue these days with SSDs. Ultimately this pushes the
issue down to the hardware where you have similar things that adding
a stand-by will increase some of the load on the master and
definitely clog up some significant amount of the stand-by resources
just with keeping up. But the more low level seems way less CPU
intensive than applying WAL.
2. Why not use some rsync-based replication from the actual data-files
rather than the WAL? Perhaps such a stand-by would not be
immediately ready to take over when the master goes down, but a
combination of an rsync based delta applied to the stand-by plus the
last few amounts of WAL should be able to bring a stand-by to a
reliable state just like pure WAL shipping method.

This would seem particularly useful if most of the query activity is not
so critical that it has to be up to the second of update from the
master. I mean, you can always have a little lag with WAL-based
replication, so your results from a standby might just be a few minutes
behind of what you would get by querying the master. This means
consistent transactions would have to be applied to the master anyway,
and some querying is involved there.

Let's think of an airline fare finder use case with a booking feature.
Most of the activity will be users finding their best itinerary
comparing fares, conveniences, maybe even available seats. Then they
start putting a reservation together, but we know that there will be
attrition, where, say 50% of the reservation work will be abandoned.
That means we would let the users build their reservation, and only when
they are getting ready to pay would we begin moving the transaction to
the master, then re-check all the preconditions (e.g., is that seat on
18C still available?) and lock the resources, ready to issue the
reservation when the payment is confirmed, all in one transaction.

If the stand-by might be 30 seconds behind the master, they could be 3
minutes behind too, or 30 min. The less they can be behind, the more
resources they have to spend on tracking the master. This can be tweaked
for real world use cases. But clearly it would be beneficial to have a
means of a light-weight replication which is good enough and doesn't
take all the massive resources that a WAL replay based replication requires.

Am I way off-base?

Can I come up with a poor-man's implementation of that? For example, say
I have 3 stand-by servers (the more stand-bys I have the more relevant
overall this gets, as every stand-by replicates the heavy work of
applying the WAL to the data files.) I would allow each stand-by to fall
behind up to n minutes. Say 10. In the 8th minute I would take it down
briefly, rsync the data files from the master, and start it back up.
This process might take just 2 minutes out of 10. (I think these are
somewhat realistic numbers). And my 3 stand-bys rotate doing that, so 2
of the 3 are always up while 1 of them might be briefly down.

This could be improved even with some OS and SAN support, all the way
down to RAID mirrors. But practically, I could mount the same file
system (e.g., BSD UFS) on the master with -o rw and on the stand-by with
-o ro. Even without UFS having any "cluster" support, I can do that. I
will have a few inconsistencies, but PostgreSQL is very tolerant of
small inconsistencies and can fix them anyway. (I would not dare try
anything like this with an Oracle system.)

Another tool I could think of using is BSD UFS snapshot support. I could
make a snapshot that is consistent from the file system perspective.
PostgreSQL writers could interface with that, issue a sync, and trigger
an UFS snapshot, then a file system sync. Now any standby who reads its
data files from this snapshot would not only have file-system level
consistency, but even database level consistency. So with this
replication using a file system mounted from multiple PgSQL servers,
replication should work well while consuming minimal amount of server
resources and also not lead to too much actual disk IO contention (seeks).

And even the disk contention could possibly be resolved by letting a
RAID mirror sync to the same snapshot point and then split it off, or do
the read activity of the stand-by server querying like crazy only from
that mirror, while batching changes to the RAID master so that they can
be applied with very low overhead.

Anyone thinking about these things?

regards,
-Gunther

Browse pgsql-performance by date

  From Date Subject
Next Message tiaswin 2022-09-14 15:02:07 Query is sometimes fast and sometimes slow: what could be the reason?
Previous Message Dave Cramer 2022-09-08 12:12:19 Re: Postgresql JDBC process consumes more memory with partition tables update delete