semi sync seems broken because of Group commit in Percona Server5.5

Bug #1254571 reported by zhai weixiang
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Invalid
Undecided
Unassigned
5.1
Invalid
Undecided
Unassigned
5.5
Triaged
Medium
Unassigned
5.6
Invalid
Undecided
Unassigned

Bug Description

In function MYSQL_BIN_LOG::trx_group_commit_leader, ONLY the leader thread calls after_flush (this function will store the binlog position using my_pthread_setspecific_ptr)

quoted code in log.cc:

5903 if (RUN_HOOK(binlog_storage, after_flush,
5904 (leader->thd, log_file_name, log_file.pos_in_file, synced)))
5905 {

After all transactions committed by the leader thread, other threads were waked up and after_commit was called

quoted code :

29 int Trans_delegate::after_commit(THD *thd, bool all)
230 {
231 Trans_param param;
232 bool is_real_trans= (all || thd->transaction.all.ha_list == 0);
233
234 param.flags = is_real_trans ? TRANS_IS_REAL_TRANS : 0;
235
236 Trans_binlog_info *log_info=
237 my_pthread_getspecific_ptr(Trans_binlog_info*, RPL_TRANS_BINLOG_INFO);
238
239 param.log_file= log_info ? log_info->log_file : 0;
240 param.log_pos= log_info ? log_info->log_pos : 0;
241
242 int ret= 0;
243 FOREACH_OBSERVER(ret, after_commit, thd, (&param));

 Here log_info is NULL if the current thread is not the leader of group commit, So these threads will never waiting for the ACK from slave.

Tags: contribution
Revision history for this message
zhai weixiang (zhaiwx1987) wrote :

a simple patch based on the logic of MySQL 5.6. Can be patched on Percona Server 5.5.18 we are using now.

tags: added: contribution
Revision history for this message
zhai weixiang (zhaiwx1987) wrote :

I found anther bug, after_flush should be called before signal_update, otherwise the binlog dump thread may be waked up first and missed to mark the packet header (to tell the slave that the master needs an ACK)

quoted code:(trx_group_commit_leader, log.cc)

5901 else
5902 {
5903 signal_update();
5904 }
5905
5906 if (RUN_HOOK(binlog_storage, after_flush,
5907 (leader->thd, log_file_name, log_file.pos_in_file, synced)))
5908 {
5909 sql_print_error("Failed to run 'after_flush' hooks");
5910 for (current= queue; current != NULL; current= current->next)
5911 {
5912 if (!current->error)
5913 {

So the attached file didn't fix this bug completely. I'll rewrite the patch based on the latest Percona Server 5.5.34

Revision history for this message
zhai weixiang (zhaiwx1987) wrote :

patch based on Percona Server 5.5.34

Revision history for this message
zhai weixiang (zhaiwx1987) wrote :

Sorry , comment of #2 is incorrect.

It's ok to call after_flush after signal_update because the LOCK_log was hold by the leader thread.

(It's a problem for me because I have split the LOCK_log for binlog write/dump).

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

I do miss some MTR test case in the patch, but hardly it makes any sense to have "New" status for this report. Let developers check and (hopefully) use this contribution.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Zhai -

Have you checked MariaDB 5.5 where our group commit implementation is taken from? Are they affected? Have they fixed this?

Revision history for this message
zhai weixiang (zhaiwx1987) wrote :

I haven't check the mariadb carefully,just quickly read the related code . I guess mariadb was affected too...Hope someone from maria can help me verifying it.

bellow is a simple test case.

source include/have_semisync_plugin.inc;
source include/have_debug.inc;
source include/have_innodb.inc;
source include/master-slave.inc;

#Install semisync plugin on slave
connection slave;
--replace_regex /\.dll/.so/
eval INSTALL PLUGIN rpl_semi_sync_slave SONAME '$SEMISYNC_SLAVE_PLUGIN';
select @@global.rpl_semi_sync_slave_enabled;
show variables like '%semi%';
SET GLOBAL rpl_semi_sync_slave_enabled= 1;
STOP SLAVE;
START SLAVE;

#Install semisync plugin on master
connection master;
--replace_regex /\.dll/.so/
eval INSTALL PLUGIN rpl_semi_sync_master SONAME '$SEMISYNC_MASTER_PLUGIN';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 100000000;

--sleep 2

CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb;
connect(con1,localhost,root,,);
connect(con2,localhost,root,,);

connection master;
SET GLOBAL rpl_semi_sync_master_wait_no_slave = 1;

connection slave;
#Stop slave, then all transaction on master will wait for restart of slave;
STOP SLAVE;

connection con1;
#leader of group commit
SET DEBUG_SYNC= "commit_after_get_LOCK_log SIGNAL con1_waiting WAIT_FOR
con2_queued";
send INSERT INTO t1 VALUES (1);

connection con2;
#follower of group commit
SET DEBUG_SYNC= "now WAIT_FOR con1_waiting";
SET DEBUG_SYNC= "commit_group_commit_queue SIGNAL con2_queued";
send INSERT INTO t1 VALUES (2);

connection master;
# we expact con1 and con2 should be both waiting for ACK
--sleep 1
SHOW PROCESSLIST;

connection slave;
START SLAVE;
--sleep 1

connection master;
SHOW PROCESSLIST;

#cleanup
DROP TABLE t1;
UNINSTALL PLUGIN rpl_semi_sync_master;

connection slave;
UNINSTALL PLUGIN rpl_semi_sync_slave;

--source include/rpl_end.inc

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3074

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.