New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail shard when index service/mappings fails to instantiate #10283
Fail shard when index service/mappings fails to instantiate #10283
Conversation
When the index service (which holds shards) fails to be created as a result of a shard being allocated on a node, we should fail the relevant shard, otherwise, it will remain stuck. Same goes when there is a failure to process updated mappings form the master. Note, both failures typically happen when the node is misconfigured (i.e. missing plugins, ...), since they get created and processed on the master node before being published. closes elastic#10283
597b7bd
to
3f5ffbe
Compare
logger.warn("[{}] failed to create index for shard {}", e, indexMetaData.index(), shard.shardId()); | ||
failedShards.put(shard.shardId(), new FailedShard(shard.version())); | ||
shardStateAction.shardFailed(shard, indexMetaData.getUUID(), | ||
"failed to create index to allocated shard " + shard.shardId() + ", failure " + ExceptionsHelper.detailedMessage(e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "allocated" should be "allocate" here.
Left one comment about grammar, but LGTM otherwise |
@dakrone thanks for the review, updated the log message, will wait for another review |
} catch (Exception e) { | ||
logger.warn("[{}] failed to create index", e, indexMetaData.index()); | ||
} catch (Throwable e) { | ||
logger.warn("[{}] failed to create index for shard {}", e, indexMetaData.index(), shard.shardId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to put the "fail a shard" logic under a method? to make sure we don't forget to put it in failedShards etc.
Change LGTM2 - left some comments regarding logging and error reporting |
+1 on getting this into 1.5.1 |
@bleskes pushed another round, review? |
LGTM again. nice clean up. |
When the index service (which holds shards) fails to be created as a result of a shard being allocated on a node, we should fail the relevant shard, otherwise, it will remain stuck. Same goes when there is a failure to process updated mappings form the master. Note, both failures typically happen when the node is misconfigured (i.e. missing plugins, ...), since they get created and processed on the master node before being published. closes #10283
When the index service (which holds shards) fails to be created as a result of a shard being allocated on a node, we should fail the relevant shard, otherwise, it will remain stuck. Same goes when there is a failure to process updated mappings form the master. Note, both failures typically happen when the node is misconfigured (i.e. missing plugins, ...), since they get created and processed on the master node before being published. closes #10283
…eerRecovery Fails due to elastic#10283
…eerRecovery Fails due to #10283
After processing mapping updates from the master, we compare the resulting binary representation of them and compare it the one cluster state has. If different, we send a refresh mapping request to master, asking it to reparse the mapping and serialize them again. This mechanism is used to update the mapping after a format change caused by a version upgrade. The very same process can also be triggered when an old master leaves the cluster, triggering a local cluster state update. If that update contains old mapping format, the local node will again signal the need to refresh, but this time there is no master to accept the request. Instead of failing (which we now do because of elastic#10283, we should just skip the notification and wait for the next elected master to publish a new mapping (triggering another refresh if needed).
After processing mapping updates from the master, we compare the resulting binary representation of them and compare it the one cluster state has. If different, we send a refresh mapping request to master, asking it to reparse the mapping and serialize them again. This mechanism is used to update the mapping after a format change caused by a version upgrade. The very same process can also be triggered when an old master leaves the cluster, triggering a local cluster state update. If that update contains old mapping format, the local node will again signal the need to refresh, but this time there is no master to accept the request. Instead of failing (which we now do because of #10283, we should just skip the notification and wait for the next elected master to publish a new mapping (triggering another refresh if needed). Closes #10311
After processing mapping updates from the master, we compare the resulting binary representation of them and compare it the one cluster state has. If different, we send a refresh mapping request to master, asking it to reparse the mapping and serialize them again. This mechanism is used to update the mapping after a format change caused by a version upgrade. The very same process can also be triggered when an old master leaves the cluster, triggering a local cluster state update. If that update contains old mapping format, the local node will again signal the need to refresh, but this time there is no master to accept the request. Instead of failing (which we now do because of #10283, we should just skip the notification and wait for the next elected master to publish a new mapping (triggering another refresh if needed). Closes #10311
After processing mapping updates from the master, we compare the resulting binary representation of them and compare it the one cluster state has. If different, we send a refresh mapping request to master, asking it to reparse the mapping and serialize them again. This mechanism is used to update the mapping after a format change caused by a version upgrade. The very same process can also be triggered when an old master leaves the cluster, triggering a local cluster state update. If that update contains old mapping format, the local node will again signal the need to refresh, but this time there is no master to accept the request. Instead of failing (which we now do because of #10283, we should just skip the notification and wait for the next elected master to publish a new mapping (triggering another refresh if needed). Closes #10311
When the index service (which holds shards) fails to be created as a result of a shard being allocated on a node, we should fail the relevant shard, otherwise, it will remain stuck. Same goes when there is a failure to process updated mappings form the master. Note, both failures typically happen when the node is misconfigured (i.e. missing plugins, ...), since they get created and processed on the master node before being published. closes elastic#10283
…eerRecovery Fails due to elastic#10283
After processing mapping updates from the master, we compare the resulting binary representation of them and compare it the one cluster state has. If different, we send a refresh mapping request to master, asking it to reparse the mapping and serialize them again. This mechanism is used to update the mapping after a format change caused by a version upgrade. The very same process can also be triggered when an old master leaves the cluster, triggering a local cluster state update. If that update contains old mapping format, the local node will again signal the need to refresh, but this time there is no master to accept the request. Instead of failing (which we now do because of elastic#10283, we should just skip the notification and wait for the next elected master to publish a new mapping (triggering another refresh if needed). Closes elastic#10311
When the index service (which holds shards) fails to be created as a result of a shard being allocated on a node, we should fail the relevant shard, otherwise, it will remain stuck.
Same goes when there is a failure to process updated mappings form the master.
Note, both failures typically happen when the node is misconfigured (i.e. missing plugins, ...), since they get created and processed on the master node before being published.