New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Redis Cluster]Fail detection take a long time in a big cluster #2336
Comments
Hello, this is the same as issue #2285, I'm working at a fix right now. The fix will be published today hopefully. |
thanks! |
You are welcome, I'll ping you here in the hope you can independently verify the fix once it's committed :-) Thank you for reporting. |
I will do so gladly |
Hello again, fix provided in the |
@antirez branch 3.0 right? I can't find branch 3.0.0 |
Yes, sorry, 3.0, or testing, which are identical currently. |
I have tested on branch 3.0, the same environment. node_timeout still is 5s. |
Thanks @Hailei , unfortunately I'm currently not able to replicate the issue: this is how I tested (however to be honest, I tested it against unstable that now got a few more optimizations in this regard):
Note: the timeout is in milliseconds, so it's 5 seconds.
Please could you test again with the latest
This is, from the point of view of node at port 30001, how many failure reports are currently being received within the window of node_timeout*2. For the PFAIL->FAIL state transition to happen, we need to get the majority of the masters. On top of that, any detail about how you run the cluster and how the failure is simulated could help. Thanks! |
p.s. also please make sure the binary is updated to unstable before re-running the cluster nodes. This is a very common source of error when trying to verify a fix. |
@antirez I had re-running the cluster on branch 3.0. I think the previous testing can't compile correctly |
detail
Fail detection cost 103s |
Thanks @Hailei ... I'll try this more. Today I'll release the new RC as a first step, but it's clear we need more work to understand the corner case you are seeing here. Thank you, news ASAP. |
I have tested on RC3, node_timeout still is 5s |
Scenario:
Redis Cluster 3.0 RC1
160 instances 80 master 80 slave. one master have one slave
10 machines
node_timeout: 15s
following is a shell script used to get fail detection time
After many times tests,find out detection time are above average 30s
I think this is too long if node_timeout is 5s.Deep into code,Gossip section,random pick three nodes each times.First consider fail node,e.g first pick one from fail node sets.
The text was updated successfully, but these errors were encountered: