Project

General

Profile

Bug #11798

upstart: configuration is too generous on restarts

Added by Greg Farnum almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
hammer, firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

See https://bugzilla.redhat.com/show_bug.cgi?id=1210871 for the investigation that prompted this.

Our current upstart scripts are probably too generous about restarting processes. At the moment each daemon is configured to restart as long as it doesn't exceed 5 crashes in 30 seconds. The restart process on some of them can exceed 6 seconds (at least some of the time), and any of our daemons which are crashing that frequently are probably stuck on a disk state issue.

We need to run some tests to figure out more reasonable values and change them.


Related issues

Copied to devops - Backport #13168: upstart: configuration is too generous on restarts Resolved 05/28/2015
Copied to devops - Backport #13091: upstart: configuration is too generous on restarts Resolved 05/28/2015

Associated revisions

Revision eaff6cb2 (diff)
Added by Sage Weil almost 9 years ago

upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)

It may take tens of seconds to restart each time, so 5 in 30s does not stop
the crash on startup respawn loop in many cases. In particular, we'd like
to catch the case where the internal heartbeats fail.

This should be enough for all but the most sluggish of OSDs and capture
many cases of failure shortly after startup.

Fixes: #11798
Signed-off-by: Sage Weil <>

Revision b3822f11 (diff)
Added by Sage Weil over 8 years ago

upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)

It may take tens of seconds to restart each time, so 5 in 30s does not stop
the crash on startup respawn loop in many cases. In particular, we'd like
to catch the case where the internal heartbeats fail.

This should be enough for all but the most sluggish of OSDs and capture
many cases of failure shortly after startup.

Fixes: #11798
Signed-off-by: Sage Weil <>
(cherry picked from commit eaff6cb24ef052c54dfa2131811758e335f19939)

Revision 20ad17d2 (diff)
Added by Sage Weil over 8 years ago

upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)

It may take tens of seconds to restart each time, so 5 in 30s does not stop
the crash on startup respawn loop in many cases. In particular, we'd like
to catch the case where the internal heartbeats fail.

This should be enough for all but the most sluggish of OSDs and capture
many cases of failure shortly after startup.

Fixes: #11798
Signed-off-by: Sage Weil <>
(cherry picked from commit eaff6cb24ef052c54dfa2131811758e335f19939)

History

#1 Updated by Sage Weil almost 9 years ago

how about 5 restarts in 10 minutes?

#2 Updated by Sage Weil almost 9 years ago

  • Status changed from New to Fix Under Review

#3 Updated by Sage Weil almost 9 years ago

  • Assignee set to Sage Weil

#4 Updated by Greg Farnum almost 9 years ago

  • Status changed from Fix Under Review to Resolved

Merged in commit:172d3ac8744c876a0f6ed99f4d63d95ea899cf85 we do 3 restarts in 30 minutes on OSD, Mon, MDS.

#6 Updated by Loïc Dachary over 8 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to hammer

#7 Updated by Loïc Dachary over 8 years ago

  • Status changed from Pending Backport to Resolved

#8 Updated by Ken Dreyer over 8 years ago

  • Backport changed from hammer to hammer, firefly

We're planning to ship this fix downstream in the RHCS 1.2 series - we might as well get it upstream in Firefly too.

#9 Updated by Ken Dreyer over 8 years ago

  • Status changed from Resolved to Pending Backport

#10 Updated by Nathan Cutler over 8 years ago

  • Project changed from Ceph to devops

#11 Updated by Loïc Dachary over 8 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF