From: | "Maeldron T(dot)" <maeldron(at)gmail(dot)com> |
---|---|
To: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | PostgreSQL super HA (High Availability) conception for 9.5+ |
Date: | 2015-11-18 10:13:43 |
Message-ID: | CAKatfSndnj9THRo6iaqXU2H1Ej3n_RzQ6G-o1OYexUEhUkm5HQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
Foreword:
Unfortunately, I have no time to read the mailing lists and attend events
like PostgreSQL and NoSQL. Some of the ideas came from MongoDB and
Cassandra. The inspiration was the pg_rewind.
There is little new here, it’s a wish-list put together, considering what
could be possible in the foreseeable future. It’s likely that people worked
on a similar or a better concept. But let me try.
Reasons:
Downtime is bad. PostgreSQL failover requires manual intervention (client
configuration or host or DNS editing). Third party tools (in my experience)
don’t offer the same stability and quality as PostgreSQL is. Also, this
concept wouldn’t work without pg_rewind.
Less software means less bugs.
Goals:
Providing near to 100% HA with minimal manual intervention. Minimizing
possible human errors during failover. Making startup founders sleep well
in the night. Automatic client configuration. Avoiding split brains.
Extras:
Automatic streaming chain configuration.
No-goals:
Multi-master replication. Sharding. Proxying. Load balancing.
Why these:
It’s better to have a working technology now than a futuristic solution in
the future. For many applications, stability and HA are more important than
sharding or multi-master.
The concept:
You can set up a single-master PostgreSQL cluster with two or more nodes
that can failover several times without manual re-configuration. Restarting
the client isn’t needed if it’s smart enough to reconnect. Third party
software isn’t needed. Proxying isn’t needed.
Cases:
Running the cluster:
The cluster is running. There is one master. Every other nodes are
hot-standby slaves.
The client-driver accepts several hostname(:port) values in the connection
parameters. They must belong to the same cluster. (The cluster’s name might
be provided too).
The rest of the options (username, database name) are the same and needed
only once. It’s not necessary to list every hosts. (Even listing one host
is enough but not recommended).
The client connects to one of the given hosts. If the node is running and
it’s a slave, it tells the client which host the master is. The client
connects to the master, even if the master was not listed in the connection
parameters.
It’s should be possible that the client stays connected to the slave for
read-only queries if the application wants to do that.
If the node the client tried connect to isn’t working, the client tries
another node and so.
Manually promoting a new master:
The administrator promotes any of the slaves. The slave tells the master to
gracefully stop. The master stops executing queries. It waits until the
slave (the new master) receives all the replication log. The new master is
promoted. The old master becomes a slave. (It might use pg_rewind).
The old master asks the connected clients to reconnect to the new master.
Then it drops the existing connections. It accepts new connections though
and tells them who the master is.
Manual step-down of the master:
The administrator kindly asks the master to stop being the master. The
cluster elects a new master. Then it’s the same as promoting a new master.
Manual shutdown of the master:
It’s same as step-down but the master won’t run as a slave until it’s
started up again.
Automatic failover:
The master stops responding for a given period. The majority of the cluster
elects a new master. Then the process is the same as manual promotion.
When the old master starts up, the cluster tells it that it is not a master
anymore. It does pg_rewind and acts as a slave.
Automatic failover can happen again without human intervention. The clients
are reconnected to the new master each time.
Automatic failover without majority:
It’s possible to tell in the config which server may act as a master when
there is no majority to vote.
Replication chain:
There are two cases. 1: All the slaves connect to the master. 2: One slave
connects to the master and the rest of the nodes replicate from this slave.
Configuration:
Every node should have a “recovery.conf” that is not renamed on promotion.
cluster_name: an identifier for the cluster. Why not.
hosts: list of the hosts. It is recommended but not needed to include every
hosts in every file. It could work as the driver, discovering the rest of
the cluster.
master_priority: integer. How likely this node becomes the new master on
failover (except manual promotion). A working cluster should not elect a
new master just because it has higher priority than the current one.
Election happens only for the described reasons above.
slave_priority: integer. If any running node has this value larger than 0,
the replication node is also elected, and the rest of the slaves replicate
from the elected slave. Otherwise, they replicate from the master.
primary_master: boolean. The node may run as master without elected by the
majority. (This is not needed on manual promotion or shutdown. See
bookkeeping.)
safe: boolean. If this is set true and any kind of graceful failover
happens, the promotion has to wait until this node also receives the whole
replication stream even if it’s not the new master. Unless it’s not
running. Every node can have this true for maximum safety.
Bookkeeping:
It would be good to know whether a node crashed or was shut down properly.
This would make a difference in master election, streaming_slave election
and the “safe” option. A two nodes cluster would highly depend on the
bookkeeping.
Bookkeeping would also help when a crashed/disconnected master that has
primary_master=true comes back but doesn’t see the rest of the cluster.
Questions:
Is there any chance that something like this gets implemented?
Thank you for reading.
M.
From | Date | Subject | |
---|---|---|---|
Next Message | Konstantin Knizhnik | 2015-11-18 10:18:39 | SPI and transactions |
Previous Message | Vitaly Burovoy | 2015-11-18 09:44:03 | Feature or bug: getting "Inf"::timestamp[tz] by "regular" value |