Bug #11798
upstart: configuration is too generous on restarts
0%
Description
See https://bugzilla.redhat.com/show_bug.cgi?id=1210871 for the investigation that prompted this.
Our current upstart scripts are probably too generous about restarting processes. At the moment each daemon is configured to restart as long as it doesn't exceed 5 crashes in 30 seconds. The restart process on some of them can exceed 6 seconds (at least some of the time), and any of our daemons which are crashing that frequently are probably stuck on a disk state issue.
We need to run some tests to figure out more reasonable values and change them.
Related issues
Associated revisions
upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
It may take tens of seconds to restart each time, so 5 in 30s does not stop
the crash on startup respawn loop in many cases. In particular, we'd like
to catch the case where the internal heartbeats fail.
This should be enough for all but the most sluggish of OSDs and capture
many cases of failure shortly after startup.
Fixes: #11798
Signed-off-by: Sage Weil <sage@redhat.com>
upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
It may take tens of seconds to restart each time, so 5 in 30s does not stop
the crash on startup respawn loop in many cases. In particular, we'd like
to catch the case where the internal heartbeats fail.
This should be enough for all but the most sluggish of OSDs and capture
many cases of failure shortly after startup.
Fixes: #11798
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit eaff6cb24ef052c54dfa2131811758e335f19939)
upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
It may take tens of seconds to restart each time, so 5 in 30s does not stop
the crash on startup respawn loop in many cases. In particular, we'd like
to catch the case where the internal heartbeats fail.
This should be enough for all but the most sluggish of OSDs and capture
many cases of failure shortly after startup.
Fixes: #11798
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit eaff6cb24ef052c54dfa2131811758e335f19939)
History
#1 Updated by Sage Weil almost 9 years ago
how about 5 restarts in 10 minutes?
#2 Updated by Sage Weil almost 9 years ago
- Status changed from New to Fix Under Review
#3 Updated by Sage Weil almost 9 years ago
- Assignee set to Sage Weil
#4 Updated by Greg Farnum almost 9 years ago
- Status changed from Fix Under Review to Resolved
Merged in commit:172d3ac8744c876a0f6ed99f4d63d95ea899cf85 we do 3 restarts in 30 minutes on OSD, Mon, MDS.
#5 Updated by Sage Weil over 8 years ago
https://github.com/ceph/ceph/pull/5930 (hammer backport)
#6 Updated by Loïc Dachary over 8 years ago
- Status changed from Resolved to Pending Backport
- Backport set to hammer
#7 Updated by Loïc Dachary over 8 years ago
- Status changed from Pending Backport to Resolved
#8 Updated by Ken Dreyer over 8 years ago
- Backport changed from hammer to hammer, firefly
We're planning to ship this fix downstream in the RHCS 1.2 series - we might as well get it upstream in Firefly too.
#9 Updated by Ken Dreyer over 8 years ago
- Status changed from Resolved to Pending Backport
#10 Updated by Nathan Cutler over 8 years ago
- Project changed from Ceph to devops
#11 Updated by Loïc Dachary over 8 years ago
- Status changed from Pending Backport to Resolved