Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only allow rebalance operations to run if all shard store data is available #14591

Merged
merged 6 commits into from Nov 10, 2015

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Nov 6, 2015

This commit prevents running rebalance operations if the store allocator is
still fetching async shard / store data to prevent pre-mature rebalance decisions
which need to be reverted once shard store data is available. This is typically happening
on rolling restarts which can make those restarts extremely painful.

Closes #14387

…ilable

This commit prevents running rebalance operations if the store allocator is
still fetching async shard / store data to prevent pre-mature rebalance decisions
which need to be reverted once shard store data is available. This is typically happening
on rolling restarts which can make those restarts extremely painful.

Closes elastic#14387
@s1monw
Copy link
Contributor Author

s1monw commented Nov 6, 2015

I tried to implement this with the least impact possible code wise to safely backport to 1.7

@@ -76,7 +76,20 @@ public boolean allocateUnassigned(RoutingAllocation allocation) {

@Override
public boolean rebalance(RoutingAllocation allocation) {
return allocator.rebalance(allocation);
final int numberOfInFlightFetch = gatewayAllocator.getNumberOfInFlightFetch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a race condition here where the gateway's requests have come back before we got here on the cluster state update thread. The in flight ops will be 0 but the results are not used yet. I wonder if we should have a flag on the allocation, something like "disableRebalance" which we can set in the gateway allocator. Maybe that will be simpler?

allocation to ensure that we skip rebalance if we have not processed
all pending shard / store fetches.
@s1monw
Copy link
Contributor Author

s1monw commented Nov 8, 2015

@bleskes I pushed a new commit

*/
return allocator.rebalance(allocation);
} else {
logger.debug("skip rebalance more that on shard/store fetch operations is still in-flight");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"skipping rebalance due to in-flight shard/store fetches" ?

@bleskes
Copy link
Contributor

bleskes commented Nov 9, 2015

looks great. Left one minor comment in ReplicaShardAllocator

@s1monw
Copy link
Contributor Author

s1monw commented Nov 9, 2015

@bleskes pushed a new commit

@bleskes
Copy link
Contributor

bleskes commented Nov 9, 2015

LGTM. Thanks @s1monw

@s1monw
Copy link
Contributor Author

s1monw commented Nov 9, 2015

@ywelsch you mentioned an integration test for this you added, is the test much different to my last commit (s1monw@7b5e323) and if so do you wanna add it to this PR?

@ywelsch
Copy link
Contributor

ywelsch commented Nov 10, 2015

My test looked similar (no need to add more of the same). I did not rely on delayed shard allocation though but explicitly disabled allocation for the restart. Maybe the test can be randomised to pick one of these approaches and also chose a random number of shards?

@s1monw
Copy link
Contributor Author

s1monw commented Nov 10, 2015

Maybe the test can be randomised to pick one of these approaches and also chose a random number of shards?

I don't like the randominzation aspect here. I think the test should be really stable to just test what it is supposed to test.

s1monw added a commit that referenced this pull request Nov 10, 2015
Only allow rebalance operations to run if all shard store data is available
@s1monw s1monw merged commit 17913c5 into elastic:master Nov 10, 2015
s1monw added a commit that referenced this pull request Nov 10, 2015
Only allow rebalance operations to run if all shard store data is available
s1monw added a commit that referenced this pull request Nov 10, 2015
Only allow rebalance operations to run if all shard store data is available
s1monw added a commit that referenced this pull request Nov 10, 2015
Only allow rebalance operations to run if all shard store data is available
@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v2.0.1 v2.1.0 v2.2.0 v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants