Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed allocation causing partial allocation of shards on allocation awareness #14010

Closed
ppf2 opened this issue Oct 7, 2015 · 1 comment
Closed
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) help wanted adoptme

Comments

@ppf2
Copy link
Member

ppf2 commented Oct 7, 2015

It is difficult to write out a full repro in words, so I recorded a video of the repro which will help.

The test uses the latest 1.7.2 release.

In short, 6 nodes in cluster, 1 index with 4 shards and 2 replicas (3 copies).
Each node has 2 awareness attributes (updateDomain and faultDomain) set (both forced). 3 nodes are in 1 updateDomain, the other 3 are in the other updateDomain. And these nodes are also in different faultDomains. Test has delayed allocation set to 10s for quicker allocation.

When an updateDomain is killed (3 nodes gone), the cluster shows partial allocation of shards - until a manual _cluster/reroute command is run (without post body) to prod it, or if a command is issued that updates the cluster state (eg. create an index). Once a manual reroute (that doesn't change anything) is run or the cluster state is updated, then the remaining shards are immediately allocated successfully based on the awareness settings.

If delayed allocation is turned off entirely, then everything works fine and there is no need to manually prod it to complete the rest of the allocation.

Note that sometimes, with delayed allocation on, it does do the right thing, but if you retest a few times stopping and restarting the 3 nodes, you will see that it doesn't do so consistently.

Repro video:
https://drive.google.com/file/d/0B1rxJ0dAZbQvRUE0SlVxT2pOZFE/view?usp=sharing

Node setup:
https://docs.google.com/document/d/1J5FPSvIA5U41Ou1BNpEN9P7q2L8e7KMxM69IG4dGMkk/edit?usp=sharing

@clintongormley
Copy link

Also see #14011

ywelsch pushed a commit that referenced this issue Nov 12, 2015
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445
ywelsch pushed a commit that referenced this issue Nov 12, 2015
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445
ywelsch pushed a commit that referenced this issue Nov 12, 2015
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445
ywelsch pushed a commit that referenced this issue Nov 12, 2015
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445
@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) help wanted adoptme
Projects
None yet
Development

No branches or pull requests

3 participants