URGENT ====== * Implement parsing of the replication_gcs GUC for spread and ensemble. * check for places there replication_enabled should be checked more extensively. complaint about select() not interrupted by signals: http://archives.postgresql.org/pgsql-hackers/2008-12/msg00448.php restartable signals 'n all that http://archives.postgresql.org/pgsql-hackers/2007-07/msg00003.php 3.2.1 Internal Message Passing ============================== * Maybe send IMSGT_READY after some other commands, not only after IMSGT_CHANGESET. Remember that local transactions also have to send an IMSGT_READY, so that their proc->coid gets reset. * Make sure the coordinator copes with killed backends (local as well as remote ones). * Check if we can use pselect to avoid race conditions with IMessage stuff within the coordinator's main loop. * Check error conditions such as out of memory and out of disk space. Those could prevent a single node from applying a remote transaction. What to do in such cases? A similar one is "limit of queued remote transactions reached". 3.2.2 Communication with the Postmaster ======================================= * Get rid of the SIGHUP signal (was IMSGT_SYSTEM_READY) for the coordinator and instead only start the coordinator as soon as the postmaster is ready to fork helper backends. Should simplify things and make them more similar to the current Postgres code, i.e. for the autovacuum launcher. * Handle restarts of the coordinator due to a crashed backend. The postmaster already sends a signal to terminate an existing coordinator process and it tries to restart one. But the coordinator should then start recovery and only allow other backends after that. Keep in mind that this recovery process is costly and we should somehow prevent nodes which fail repeatedly from endlessly consuming resources of the complete cluster. * The backends need to report errors from remote *and* local transactions to the coordinator. Worker backends erroring out while waitin for changesets are critical. Erroring out due to serialization failure is fine, we can simply ignore the changeset, once it arrives late on. But other errors are probably pretty bad at that stage. Upon crashes, the postmaster restarts all backends and the coordinator anyway, so the backend process itself can take care of informing the coordinator via imessages. * Think about a newly requested helper backend crashing before it registers with the coordinator. That would prevent requesting any further helper backend. 3.2.3 Group Communication System Issues ======================================= * Drop the static receive buffers of the GCS interfaces in favor of a dynamic one. It's much easier to handle. * Hot swapping of the underlying GCS of a replicated database is currently not supported. It would involve waiting for all nodes of the group to have joined the new group, then swap. If we enforce the GCS group name to equal the database name, that's needed for renaming a replicated database. Might be a good reason against that rule. * Better error reporting to the client in case of GCS errors. There are three phases: connecting, initialization and joining the group. All those can potentionally fail. Currently, an ALTER DATABASE waits until it gets a DB_STATE_CHANGE, for which it waits forever, if something fails. 3.3.1 Group Communication Services ================================== * Prevent EGCS from sending an initial view which does not include the local node. * Complete support for Spread * Support for Appia 3.3.2 Global Object Identifiers =============================== * Use a naming service translating local OIDs to global ids, so that we don't have to send the full schema and table name every time. 3.3.3 Global Transaction Identifiers ==================================== * Drop COIDs in favor of GIDs 3.4 Collection of Transactional Data Changes ============================================ * Make sure we correctly serialize transactions, which modify tuples that are referenced by a foreign key. An insert or update to a tuple with a reference to somewhere must make sure the referenced tuple didn't change. (The other way around should be covered automatically by the changeset, because it also catches changes by the ON UPDATE or ON DELETE hooks of the affected foreign key.) Write tests for that behaviour. * Think about removing these additional members of the EState: es_allLocksGranted, es_tupleChangeApplied and es_loopCounter. Those can certainly be simplified. * Take care of a correct READ COMMITTED mode, which requires changes of a committed transaction to be visible immediately to all other concurrently running transactions. This might be very similar to a fully synchronous, lock based replication mode. This certainly introduces higher commit latency. * Add the schema name to the changeset and seq_increment messages to fully support namespaces. * Support for savepoints requires communicating additional sub-transaction states. 3.6 Application of Change Sets ============================== * Possibly use heap_{insert,update,delete} directly, instead of going through ExecInsert, ExecUpdate and ExecDelete? That could save us some conditionals, but we would probably need to re-add other stuff. * Possibly limit ExecOpenIndices() to open only the primary key index for CMD_DELETE? * Check if ExecInsertIndexTuples() could break due to out of sync replica with UNIQUE constraint violations. * Make sure the statement_timeout does not affect hepler backends. * Prevent possible deadlocks which might occur by re-ordered (optimistic) application of change sets from remote transactions. Just make sure the next transaction according to the decided ordering always has a spare helper backend available to get executed on and is not blocked by other remote transactions which must wait for it (and thus cause deadlocking). 3.8.2 Data Definition Changes ============================= * check which messages the coordinator must ignore, because they could originate from backends which were running concurrently to a STOP REPLICATION command. Such backends could possibly send changesets and other replication requests. * Add proper handling of CREATE / ALTER / DROP TABLE and make sure those don't interfere with normal, parallel changeset application. 3.9 Initialization and Recovery =============================== * helper processes connected to template databases should exit immediately after having performed their job, so CREATE DATABASE from such a template database works again. 3.9.1 Initialization and Recovery: Data Transfer ================================================ 3.9.2 Initialization and Recovery: Schema Adaption ================================================== * Implement schema adaption * Make sure triggers and constraints either do only contain functions which are available on every node _or_ execute the triggers and check the constraints only on the machines having them (remote execution?) 3.9.5 Initialization and Recovery: Full Cluster Shutdown and Restart ==================================================================== * After a full crash (no majority running, thus stopped cluster wide operation), we need to be able to recover from the distributed, permanent storage into a consistent state. This requires nodes communicating their recently committed transactions, which didn't make it to the other nodes before the crash. Cleanup ======= * merge repl_database_info::state into the group::nodes->state, and the main_state into main_group::nodes->state. Add a simpler routine to retrieve the local node. * Cleanup the "node_id_self_ref" mess. The GCS should not be able to send a viewchange to the coordinator, which does not include the local node itself. In that sense, maybe "nodes" doesn't need to include the local node? * Reduce the amount of elog(DEBUG...) to a usefull level. Currently mainly DEBUG3 is used, sometimes DEBUG5. Maybe also rethink the precompiler flags which enable or disable this verbose debugging. * At the moment, exec_simple_query is exported to the replication code, where in stock Postgres, that call is static. * Same applies for ExecInsert, which is no longer static, but also used in the recovery code. However, that should be mixed into ExecProcessCollection() to reduce code duplication anyway. * Consistently name the backends 'worker' and 'helper' backends? * Never call cset_process() from worker backends! Fix the comment above that function. * The recovery subscriber currently issues a CREATE DATABASE from within a transaction block. That's unclean. * The database encoding is transferred as a number, not string. Not sure if that matters.