Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Redis Cluster]Fail detection take a long time in a big cluster #2336

Open
Hailei opened this issue Jan 29, 2015 · 14 comments
Open

[Redis Cluster]Fail detection take a long time in a big cluster #2336

Hailei opened this issue Jan 29, 2015 · 14 comments

Comments

@Hailei
Copy link
Contributor

Hailei commented Jan 29, 2015

Scenario:
Redis Cluster 3.0 RC1
160 instances 80 master 80 slave. one master have one slave
10 machines
node_timeout: 15s

following is a shell script used to get fail detection time

  sleep 10
        start_time=`date +%s`
        while true
        do
          r_host=${HOSTS[$(($RANDOM%10))]}
          r_port=${PORTS[$(($RANDOM%16))]}
          if [ $fail_host == $r_host ] && [ $fail_port == $r_port ]; then
            continue
          fi  
          fail_flag=`redis-cli -p $r_port -h $r_host cluster nodes|grep "$host:$port"|grep fail?|wc -l`
          if [ $? -ne 0 ] || [ $fail_flag -ne 0 ]; then
             sleep 1
          else
            break
          fi  
        done
        end_time=`date +%s`
        echo "fail detect cost:$((end_time - start_time + 10))"

After many times tests,find out detection time are above average 30s

I think this is too long if node_timeout is 5s.Deep into code,Gossip section,random pick three nodes each times.First consider fail node,e.g first pick one from fail node sets.

@antirez
Copy link
Contributor

antirez commented Jan 29, 2015

Hello, this is the same as issue #2285, I'm working at a fix right now. The fix will be published today hopefully.

@Hailei
Copy link
Contributor Author

Hailei commented Jan 29, 2015

thanks!

@antirez
Copy link
Contributor

antirez commented Jan 29, 2015

You are welcome, I'll ping you here in the hope you can independently verify the fix once it's committed :-) Thank you for reporting.

@Hailei
Copy link
Contributor Author

Hailei commented Jan 29, 2015

I will do so gladly

@antirez
Copy link
Contributor

antirez commented Jan 29, 2015

Hello again, fix provided in the 3.0.0 branch. Please could you check if this improved things for you? Thanks!

@Hailei
Copy link
Contributor Author

Hailei commented Jan 30, 2015

@antirez branch 3.0 right? I can't find branch 3.0.0

@antirez
Copy link
Contributor

antirez commented Jan 30, 2015

Yes, sorry, 3.0, or testing, which are identical currently.

@Hailei
Copy link
Contributor Author

Hailei commented Jan 30, 2015

I have tested on branch 3.0, the same environment. node_timeout still is 5s.
About fail detection,master just 5s,but slave take 112s ,sometimes can't promoted to fail from pfail.
base on previous testing,slave always take longer time.
As a matter of time,only three more test.Hopefully this can help you

@antirez
Copy link
Contributor

antirez commented Jan 30, 2015

Thanks @Hailei , unfortunately I'm currently not able to replicate the issue: this is how I tested (however to be honest, I tested it against unstable that now got a few more optimizations in this regard):

  1. I started 180 nodes, 90 masters, 90 slaves, using the utils/create-cluster script, with the following configuration in config.sh:
PORT=30000
TIMEOUT=5000
NODES=180
REPLICAS=1

Note: the timeout is in milliseconds, so it's 5 seconds.

  1. To create the cluster, I used: create-cluster start, and create-cluster create.
  2. I selected a slave port, 30176, and an observer port, 30001, and used the script at util/cluster_fail_time.tcl. I just edited the script and changed the two ports.
  3. Executing the script, it evaluates how much time it takes to go from FAIL? to FAIL. This is what I get, as average of 10 runs: AVG(10): 3309.6. Again these are milliseconds, so after 3.3 seconds the system switches from FAIL? to FAIL in the case of a slave.

Please could you test again with the latest unstable branch? If it does not work, this is information that may be useful:

  1. Make sure node-timeout is the same in all the nodes, inspecting it with:

    ./redis-trib.rb call 127.0.0.1:30001 config get cluster-node-timeout

  2. If node timeout is ok, check during the time FAIL? does not get promoted to FAIL, what is the number of failure reports nodes are able to get, with the following command:

    redis-cli -p 30001 cluster count-failure-reports 56ec6aea6f347f421a26d22a333c0fe7fcbc52ad
    (integer) 0

This is, from the point of view of node at port 30001, how many failure reports are currently being received within the window of node_timeout*2. For the PFAIL->FAIL state transition to happen, we need to get the majority of the masters.

On top of that, any detail about how you run the cluster and how the failure is simulated could help. Thanks!

@antirez
Copy link
Contributor

antirez commented Jan 30, 2015

p.s. also please make sure the binary is updated to unstable before re-running the cluster nodes. This is a very common source of error when trying to verify a fix.

@Hailei
Copy link
Contributor Author

Hailei commented Jan 30, 2015

@antirez I had re-running the cluster on branch 3.0.
result:
master 1113s
slave 20
70s

I think the previous testing can't compile correctly

@Hailei
Copy link
Contributor Author

Hailei commented Jan 30, 2015

detail

redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 19
tty:[1] jobs:[0] cwd:[/opt]
21:13 $ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 27
tty:[1] jobs:[0] cwd:[/opt]
21:13 $ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 27
tty:[1] jobs:[0] cwd:[/opt]
21:13 $ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 28
tty:[1] jobs:[0] cwd:[/opt]
21:13 $ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 29
tty:[1] jobs:[0] cwd:[/opt]
21:13 [$ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 27
tty:[1] jobs:[0] cwd:[/opt]
21:13 $ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 28
tty:[1] jobs:[0] cwd:[/opt]
21:13$ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 26
tty:[1] jobs:[0] cwd:[/opt]
21:13$ redis-cli -p 7385 cluster count-failure-reports 47a7c3bc109deeeac5f16a4e634a989d89c994c7
(integer) 23

Fail detection cost 103s

@antirez
Copy link
Contributor

antirez commented Jan 30, 2015

Thanks @Hailei ... I'll try this more. Today I'll release the new RC as a first step, but it's clear we need more work to understand the corner case you are seeing here. Thank you, news ASAP.

@Hailei
Copy link
Contributor Author

Hailei commented Feb 2, 2015

I have tested on RC3, node_timeout still is 5s
master take 9 seconds to complete fail detection,including mark pfail and pfail->fail.
slave take 11 seconds.
RC compare with the previous testing branch,I found commit "Cluster: some bias towwards FAIL/PFAIL nodes in gossip sections."
this commit brings significant speed improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants