WL#6860: Binlogging XA-prepared transaction

Affects: Server-5.7   —   Status: Complete

This worklog adds support for XA-transactions to replication. An XA-transaction 
allows the client to participate in the two-phase commit protocol. The state of 
the XA-transaction being prepared is persisted in the database; i.e., a prepared 
XA-transaction will survive both a client reconnects or the server restarts.

Currently, a prepared XA-transaction will be lost after a client reconnects or 
the server restarts. Through this worklog, an XA-transaction will be binlogged
in two rounds:
R1)In the first round at XA-prepare the transaction is logged
as prepared but not yet committed. Its state is persisted in the Engine (Innodb).
R2)When finally the transaction gets committed (XA-commit) or rolled back
(XA-rollback) the second binlogging round completes the whole logging making
XA-commit/rollback into the binlog.

Therefore XA-transaction causes two GTIDs each assigned to the prepare or the
commit part in its round.
There may be other transactions (XA or not XA) binlogged in between the
XA-prepare part and the XA-commit/rollback part.

REFERENCES:
- BUG#12161
FR0 The prepared XA transaction (optionally) survives the server restart or
    client disconnection to cease the former rollback policy at disconnection.
    This must work with both --skip-log-bin and --log-bin.

FR1 Binary logging of XA transaction is done in two phases:
    a. New replication event type XA-prepare at time of XA PREPARE
    b. Logging XA COMMIT,ROLLBACK later separately.
    Notice XA-prepare event is not necessary followed by its XA-COMMIT or
    ROLLBACK which could cause interleaved binary logging of any
    two XA transactions.
    The prepare part of XA and the commit part of XA may appear in different
    binlog files.

FR2 Slave applier must be able to handle multiple interleaved
    XA transactions, as in sequential so in any parallel mode.

FR3 New logging style of XA transaction must be compatible with all
    existing replication features including binlog format, GTID (on|off),
    MTS (any kind of scheduler incl the "legacy" sequential), mixed engine
    transactions with direct or cached DML:s on non-transactional tables,
    and slave side relay-log repository types.


There's no non-functional requirement.
=== Summary ===

When the server runs with replication (binary log ON) turn on
prepared XA transaction had to be rolled back at disconnection.
The reason was that even though the prepared transaction could be discovered
at the server recovery it's impossible to log the transaction content, at least
in formats of @@binlog_format.
The limitation is lifted by making the prepared transaction binlogged before
disconnection or the server crash takes place.
An when there's certainty that the transaction content was logged the final
transaction's XA COMMIT or XA ROLLBACK can be logged too.

  --connection first

   XA START 'trx';
   /* dml operations */
   XA END   'trx';
   XA PREPARE 'trx';  /* => trx got prepared and binlogged */

  --connection second
   SHOW BINLOG EVENT; /* => must display XA START .. XA PREPARE */

 
  # At this point when either the connection disconnects 

  --disconnect first

  # OR
  #
  # the server "disconnects" by crashing or shutdown
  # any external connection (after the server restart in the 2nd case)
  # must be able to find 'trx' in the list
  # of prepared transactions, and commit or rollback:

  --connection any_connection

   XA RECOVER;         # must display `trx'
   XA COMMIT 'trx';    # must commit trx, be logged as a Query-log-even and
                       # return  OK

=== Binary logging and Slave applier extension ===

Binary logging extension to write prepared XA and its Commit or
Rollback decision as separate group of events into the binary log.
That makes XA-binlogging possibly interleaving yet without any harm to
data consistency after eventual replaying on the slave.
The slave applier is taught to deal with XA-prepared group of events and its
termination (Commit or Rollback) event.

=== XA transaction caching and recovery extension ===

Extension to XA recovery implementation is made in that a connection closing
leaves a prepared XA in the transaction cache as well as specially marked in
Innodb. Such prepared XA can be discovered and terminated as the user
wishes, as well as be handled by the slave applier.

=== Handlerton interface extension ===

A new handlerton interface is added up allowing
attach and detach a SE "internal" transaction from the server level transaction
handle. It is motivated by needs of the slave applier that switches between
interleaved XA-transactions first to prepare them and then to commit (or rollback).

=== Innodb changes ===

Besides the new handlerton method initialization
some augmentment is made to connection close logics in Innodb as well as
changes are done to maintain disconnected transaction's state sane to survive
the server restart.


=== User Interface ===

There no new features added. The user is informed about recovered
prepared transaction at the server startup time by existing
facilities.

Low-level tasks are dicussed in the order of HLS list.

=== Binary logging extension ===

New style of binary logging of XA transaction is done in two rounds, see Ra. and Rb:

Ra. New replication event type XA-prepare at time of XA PREPARE

When XA-transaction starts executing its first DML operation it registers
the binlog hton and initiates the header binlog event which is
a Query-log-event of XA-start query. The event naturally contains
an encoded XID.
binlog_prepare() hton is extended to exectute MYSQL_BIN_LOG::commit()
when XA-prepare is handled through trans_xa_prepare().
MYSQL_BIN_LOG::commit() is extended to execute a special
XA-prepare branch.
The transaction content is sandwiched in between the logged XA-START
of Query-log-event and a new XA_PREPARE_LOG_EVENT event.
Extension to MYSQL_BIN_LOG::commit() ensures no committing of 
XA-prepared transaction to the engine.
GTID for the XA-prepared transaction is generated according to
the regular committing in binlog transaction rules.

Rb Logging XA COMMIT,XA ROLLBACK later separately

XA COMMIT|XA ROLLBACK query-log-event is generated through
binlog_xa_{commit,rollback} which is another extension to binary logging routine.
By that point the XA-transaction cache must've been already flushed.
The XA-transaction terminal query is logged as a stand-alone query
also containing the encoded XID and a GTID value different from the
prepared part's GTID.

Here is an example of a typical interleaving logging of two XA-transactions:

   SET GTID_NEXT=gid_1;
   XA START xid_1;
   call update_data();
   XA PREPARE xid_1;
   SET GTID_NEXT=gid_2;
   XA START xid_2;
   call update_data();
   XA PREPARE xid_2;
   
   SET GTID_NEXT=gid_3;
   XA COMMIT xid_1;
   SET GTID_NEXT=gid_4;
   XA COMMIT xid_2;

XA COMMIT|XA ROLLBACK is singled out into a Query-log-event that is logged
only when its XA-prepared transaction part was logged as indicated through
the flag of

   thd->transaction.xid_state.is_binlogged

to be raised when XA-prepared indeed has something to log.

The flag is checked in either branch (the regular or the recovery) of
XA-COMMIT|ROLLBACK logging.
At the server restart the flag is set to TRUE 'cos the transaction
must've updated data (that's why it left in the engine) and logged
XA-prepared part before (disconnection) shutdown/crash.


=== Slave applier extension ===

To cope with XA's interleaved logging the slave applier has to be able
to switch (see Handlerton extension section for details) from one
transaction to another. In MTS case the DB scheduler can't assign
XA-commit to another worker, and that is addressed by making XA-commit
to depend ot a magic max # of accessed db:s tag which forces
synchronization with all other workers prior to execute XA-commit.

The clock scheduler does not need any changes. The XA-prepare commit
timestamp is guaranteed to be lesser than any XA-commit's possible
commit parent that makes XA-commit be schedulable to any worker.

The Query-log-event of XA-start is made as legal group starter
(starts_group() of log_event.h). The new XA-prepare event plays the group
terminal event. That is reflected in trx_boundary_parser.

New XA-prepare log event is made inherited from Xid-log-event for few
reasons. One of them is to reuse Xid's functionality as replication
event group terminal event.
The actual dependency is forced to be like the following:

      binary_log::
      Binary_log_event
               ^         "main"::Log_event
               |                 ^
               |                 |
      binary_log::      "main"::
      XA_prepare_event  Xid_apply_log_event
                \       /
                 \     /
                  \   /
             XA_prepare_log_event

Here a new Xid_apply_log_event is a common parent with Xid_log_event
whose dependecies are changed to

        Binary_log_event
               ^
               |
               |
           Xid_event  Xid_apply_log_event
                \       /
                 \     /
                  \   /
               Xid_log_event


A specific of XA-start is to replace the currently associated engine
transaction with a new one that the engine must initiate internally.
The old association is preserved inside a new member of Transaction_ctx.
At the end of XA-prepare the pre-XA-start assocication is restored.


=== Handlerton interface extension ===

A new handlerton interface

+  /**
+     Associated with THD engine's native transaction is replaced 
+     with that of the 2nd argument.
+     The old value is returned through a buffer if non-null pointer
+     is provided with the 3rd argument.
+     The method is adapted by XA start and XA prepare handlers to
+     handle XA tranasaction that is logged as two parts by slave applier.
+
+     This interface concerns engines that are aware of XA transaction.
+  */
+  void (*replace_native_transaction_in_thd)(THD *thd, void *new_trx_arg,
+                                    void **ptr_trx_arg);

facilitates to the slave applier ability process XA transaction's
phases in interleaved manner. The applier executes the XA-prepare and
disconnects from the current XA. Another of this applier later will be
scheduled with the XA-COMMIT|XA-ROLLBACK to handle which there's been
always "external" facility in place.

=== XA transaction caching, recovery extension ===

When the transaction is already prepared and binlogged and the client
disconnects the server won't destroy Transaction_ctx object (therefore
its XID_STATE) associated with the XA transaction, find an
"envisioned" block in sql_class.cc
  -#ifdef ENABLE_WHEN_BINLOG_WILL_BE_ABLE_TO_PREPARE
Transaction_ctx object remains in the transaction_cache (see
xa.{h,cc}) and gets marked with a special new is_binlogged flag that
is raised The flag affects binary logger execution when it processes
following XA-COMMIT|XA-ROLLBACK.
For that purpose a new function is introduced:

+/**
+  Transaction is marked in the cache as to be recovered.
+  The method allows to sustain prepared transaction disconnection.
+
+  @param transaction
+                 Pointer to Transaction object that is replaced.
+
+  @return  operation result
+    @retval  false   success or a cache already contains XID_STATE
+                     for this XID value
+    @retval  true    failure
+*/
+
+bool transaction_cache_unrecover(Transaction_ctx *transaction);

Its implemented to reuse logics of transaction_cache_insert_recovery().

Notice similarly to the server layer's XA unrecovering, the engine
does not destroy its transaction view (see ha_innodb.cc,
connection_close handlerton method changes).
The preserved Transaction_ctx object remains to be cross-linked with
the prepared innodb transaction.
The association is restored at the server restart in which event the server
level Transaction_ctx object is reconstructed and inserted into the
global Transaction_cache with the new is_binlogged flag raised.


=== Innodb changes ===

1. The new hton method is implemented to clear out or restore THD to trx
association.
   See innodb_replace_trx_in_thd().

2. trx_disconnect_from_mysql() is added to act what the name suggest which
   is essentially a lighter cleanup than that of trx_free_in_mysql().
   The main use case innobase_close_connection() that starts using 
   the lighter method for TRX_STATE_PREPARED transaction's connection.

3. A new flag to trx_t struct is introduced to give us
   a way to distinguish between the "cold" transaction
   recovered in the server recovery time and "warm" one that is accessed via
   recovery interface at the same server runtime as it was created.

   A use case for this task is 
   --connection current
     XA start 'trans';
     ...
     XA-prepare 'trans';
   --disconnect current

   --connect  next
   XA commit 'trans'

   In particular
       lock_trx_release_locks()
   should not 
       trx_sys->n_prepared_recovered_trx--
   when seeing such "warm" recoverable trx.

   In the current patch it's achieved via adding a new member to

struct trx_t{
+	ulint		is_disconnected;/*!< 0=normal transaction,
+                                        1=prepared and disconnected so could
+                                        be recovered via xid interface */


  to set it to 1 in innodb_replace_trx_in_thd() (the slave applier) and
  innobase_close_connection() (master side).

4. Notice trx_deregister_from_2pc(trx) at the end of
innobase_{commit,rollback}_by_xid
   might be fixing a bug.

5. trx->will_lock = 0
   has to be added to the disconnected prepared trx cleanup to please trx cache.