New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only allow rebalance operations to run if all shard store data is available #14591
Conversation
…ilable This commit prevents running rebalance operations if the store allocator is still fetching async shard / store data to prevent pre-mature rebalance decisions which need to be reverted once shard store data is available. This is typically happening on rolling restarts which can make those restarts extremely painful. Closes elastic#14387
I tried to implement this with the least impact possible code wise to safely backport to 1.7 |
@@ -76,7 +76,20 @@ public boolean allocateUnassigned(RoutingAllocation allocation) { | |||
|
|||
@Override | |||
public boolean rebalance(RoutingAllocation allocation) { | |||
return allocator.rebalance(allocation); | |||
final int numberOfInFlightFetch = gatewayAllocator.getNumberOfInFlightFetch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a race condition here where the gateway's requests have come back before we got here on the cluster state update thread. The in flight ops will be 0 but the results are not used yet. I wonder if we should have a flag on the allocation, something like "disableRebalance" which we can set in the gateway allocator. Maybe that will be simpler?
allocation to ensure that we skip rebalance if we have not processed all pending shard / store fetches.
@bleskes I pushed a new commit |
*/ | ||
return allocator.rebalance(allocation); | ||
} else { | ||
logger.debug("skip rebalance more that on shard/store fetch operations is still in-flight"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"skipping rebalance due to in-flight shard/store fetches" ?
looks great. Left one minor comment in ReplicaShardAllocator |
@bleskes pushed a new commit |
LGTM. Thanks @s1monw |
@ywelsch you mentioned an integration test for this you added, is the test much different to my last commit (s1monw@7b5e323) and if so do you wanna add it to this PR? |
My test looked similar (no need to add more of the same). I did not rely on delayed shard allocation though but explicitly disabled allocation for the restart. Maybe the test can be randomised to pick one of these approaches and also chose a random number of shards? |
I don't like the randominzation aspect here. I think the test should be really stable to just test what it is supposed to test. |
Only allow rebalance operations to run if all shard store data is available
Only allow rebalance operations to run if all shard store data is available
Only allow rebalance operations to run if all shard store data is available
Only allow rebalance operations to run if all shard store data is available
This commit prevents running rebalance operations if the store allocator is
still fetching async shard / store data to prevent pre-mature rebalance decisions
which need to be reverted once shard store data is available. This is typically happening
on rolling restarts which can make those restarts extremely painful.
Closes #14387