Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'I/O operation on closed file' on SIGHUP #863

Closed
wmttom opened this issue Aug 25, 2014 · 22 comments · Fixed by #865
Closed

'I/O operation on closed file' on SIGHUP #863

wmttom opened this issue Aug 25, 2014 · 22 comments · Fixed by #865

Comments

@wmttom
Copy link

wmttom commented Aug 25, 2014

2014-08-25 13:51:42 [16744] [INFO] Handling signal: hup
2014-08-25 13:51:42 [16744] [INFO] Hang up: Master
2014-08-25 13:51:42 [7683] [INFO] Booting worker with pid: 7683
2014-08-25 13:51:42 [7684] [INFO] Booting worker with pid: 7684
2014-08-25 13:51:42 [7685] [INFO] Booting worker with pid: 7685
2014-08-25 13:51:42 [7686] [INFO] Booting worker with pid: 7686
2014-08-25 13:51:42 [7687] [INFO] Booting worker with pid: 7687
2014-08-25 13:51:42 [7688] [INFO] Booting worker with pid: 7688
2014-08-25 13:51:42 [7689] [INFO] Booting worker with pid: 7689
2014-08-25 13:51:42 [38739] [INFO] Worker exiting (pid: 38739)
2014-08-25 13:51:42 [38743] [INFO] Worker exiting (pid: 38743)
2014-08-25 13:51:42 [38745] [INFO] Worker exiting (pid: 38745)
2014-08-25 13:51:42 [38744] [INFO] Worker exiting (pid: 38744)
2014-08-25 13:51:42 [38741] [INFO] Worker exiting (pid: 38741)
2014-08-25 13:51:42 [38742] [INFO] Worker exiting (pid: 38742)
2014-08-25 13:51:42 [38740] [INFO] Worker exiting (pid: 38740)
2014-08-25 13:51:42 [38746] [INFO] Worker exiting (pid: 38746)
2014-08-25 13:51:42 [7690] [INFO] Booting worker with pid: 7690
2014-08-25 13:51:43 [16744] [INFO] Unhandled exception in main loop:
Traceback (most recent call last):
  File "/home/httpapi/.virtualenvs/hizonenum/lib/python2.7/site-packages/gunicorn/arbiter.py", line 174, in run
    self.sleep()
  File "/home/httpapi/.virtualenvs/hizonenum/lib/python2.7/site-packages/gunicorn/arbiter.py", line 308, in sleep
    oldest = min(w.tmp.last_update() for w in worker_values)
  File "/home/httpapi/.virtualenvs/hizonenum/lib/python2.7/site-packages/gunicorn/arbiter.py", line 308, in <genexpr>
    oldest = min(w.tmp.last_update() for w in worker_values)
  File "/home/httpapi/.virtualenvs/hizonenum/lib/python2.7/site-packages/gunicorn/workers/workertmp.py", line 50, in last_update
    return os.fstat(self._tmp.fileno()).st_ctime
ValueError: I/O operation on closed file

2014-08-25 13:51:43 [7689] [INFO] Worker exiting (pid: 7689)
2014-08-25 13:51:43 [7683] [INFO] Worker exiting (pid: 7683)
2014-08-25 13:51:43 [7687] [INFO] Worker exiting (pid: 7687)
2014-08-25 13:51:43 [7690] [INFO] Worker exiting (pid: 7690)
2014-08-25 13:51:43 [7684] [INFO] Worker exiting (pid: 7684)
2014-08-25 13:51:43 [7688] [INFO] Worker exiting (pid: 7688)
2014-08-25 13:51:43 [7686] [INFO] Worker exiting (pid: 7686)
2014-08-25 13:51:43 [7685] [INFO] Worker exiting (pid: 7685)
@svdgraaf
Copy link

Interestingly, I had the same issue, but at a random point in time:

[2014-08-30 03:13:25 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:15513)
[2014-08-30 03:13:26 +0000] [18391] [INFO] Booting worker with pid: 18391
[2014-08-30 03:13:27 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:16607)
[2014-08-30 03:13:28 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:17159)
[2014-08-30 03:13:28 +0000] [18428] [INFO] Booting worker with pid: 18428
[2014-08-30 03:13:28 +0000] [18429] [INFO] Booting worker with pid: 18429
[2014-08-30 03:13:30 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:16850)
[2014-08-30 03:13:31 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:16998)
[2014-08-30 03:13:31 +0000] [18538] [INFO] Booting worker with pid: 18538
[2014-08-30 03:13:31 +0000] [18539] [INFO] Booting worker with pid: 18539
[2014-08-30 03:13:34 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:17032)
[2014-08-30 03:13:35 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:17125)
[2014-08-30 03:13:35 +0000] [18609] [INFO] Booting worker with pid: 18609
[2014-08-30 03:13:35 +0000] [18610] [INFO] Booting worker with pid: 18610
[2014-08-30 03:13:36 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:16812)
[2014-08-30 03:13:37 +0000] [16812] [INFO] Worker exiting (pid: 16812)
Exception _mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now") in <bound method Cursor.__del__ of <MySQLdb.cursors.Cursor object at 0x38ae210>> ignored
[2014-08-30 03:13:37 +0000] [18682] [INFO] Booting worker with pid: 18682
127.0.0.1 - - [30/Aug/2014:03:13:37 +0200] "POST [redacted] HTTP/1.0" 200 - "[redacted]" "Mozilla/5.0 (Linux; Android 4.3; SAMSUNG-SGH-I747 Build/JSS15J) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.141 Mobile Safari/537.36"
127.0.0.1 - - [30/Aug/2014:03:13:38 +0200] "GET [redacted] HTTP/1.0" 200 - "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0"
[2014-08-30 03:15:00 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18539)
[2014-08-30 03:15:01 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18391)
[2014-08-30 03:15:01 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18538)
[2014-08-30 03:15:01 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18609)
[2014-08-30 03:15:01 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18610)
[2014-08-30 03:15:01 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18428)
[2014-08-30 03:15:01 +0000] [9531] [CRITICAL] WORKER TIMEOUT (pid:18429)
[2014-08-30 03:15:01 +0000] [18919] [INFO] Booting worker with pid: 18919
[2014-08-30 03:15:01 +0000] [9531] [INFO] Unhandled exception in main loop:
Traceback (most recent call last):
  File "/app/shared/venv/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 174, in run
    self.sleep()
  File "/app/shared/venv/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 308, in sleep
    oldest = min(w.tmp.last_update() for w in worker_values)
  File "/app/shared/venv/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 308, in <genexpr>
    oldest = min(w.tmp.last_update() for w in worker_values)
  File "/app/shared/venv/local/lib/python2.7/site-packages/gunicorn/workers/workertmp.py", line 50, in last_update
    return os.fstat(self._tmp.fileno()).st_ctime
ValueError: I/O operation on closed file

[2014-08-30 03:15:01 +0000] [18919] [INFO] Worker exiting (pid: 18919)
[2014-08-30 03:15:01 +0000] [18682] [INFO] Worker exiting (pid: 18682)

gunicorn==19.1.1

benoitc added a commit that referenced this issue Aug 30, 2014
When the worker exited the tempfile is not available anymore so we can't
get the last update and calculate the dynamic timeout introduced in
d76bab4 .

This changes fix it by catching the IO error.

fix #863
@benoitc
Copy link
Owner

benoitc commented Aug 30, 2014

I committed a fix in the branch #865 . Please review.

benoitc added a commit that referenced this issue Aug 30, 2014
When the worker exited the tempfile is not available anymore so we can't
get the last update and calculate the dynamic timeout introduced in
d76bab4 .

This changes fix it by catching the IO error.

fix #863
@sholsapp
Copy link
Contributor

Hey @benoitc et al., any idea when will we get a release that has this fix in it?

@benoitc
Copy link
Owner

benoitc commented Nov 10, 2014

I will make a minor release tomorrow :)

@sholsapp
Copy link
Contributor

Thanks @benoitc that would awesome. 👍

@sholsapp
Copy link
Contributor

Hey @benoitc I don't see it on PyPI quite yet. Just a friendly reminder.

@nbaggott
Copy link

@benoitc, looking forward to the new release. This bug is very severe for high volume apps that aggressively recycle workers. In the interm, I'm working around by applying a monkey patch - but that doesn't feel good.

@benoitc
Copy link
Owner

benoitc commented Nov 12, 2014

Sorry i was too much busy to prepare the release, but I will tomorrow.

@nbaggott
Copy link

@benoitc, I've been testing the patch (applied via monkey) and it doesn't seem to completely fix the problem. My app still managed to crash with a stack trace like:

2014-11-17 14:38:54 [4910] [INFO] Unhandled exception in main loop:
Traceback (most recent call last):
  File "/export/content/lid/apps/ekgapp/9ef005b54139d8dba20bd80e24b34b25fc7709f8/lib/python2.6/site-packages/gunicorn/arbiter.py", line 174, in run
    self.sleep()
  File "/export/content/lid/apps/ekgapp/9ef005b54139d8dba20bd80e24b34b25fc7709f8/lib/python2.6/site-packages/ekgapp/gunicorn_conf.py", line 42, in sleep
    last_update = w.tmp.last_update()
  File "/export/content/lid/apps/ekgapp/9ef005b54139d8dba20bd80e24b34b25fc7709f8/lib/python2.6/site-packages/gunicorn/workers/workertmp.py", line 50, in last_update
    return os.fstat(self._tmp.fileno()).st_ctime
OSError: [Errno 9] Bad file descriptor

I think the except block in the sleep() needs to ignore OSError in addition to ValueError.

I spent some time trying to figure out how this could happen and my best guess is that there is still a race condition caused by the signal handler for SIGCHLD, which calls worker.tmp.close() in reap_workers(). If the signal handler is called before/during sleep() then WorkerTmp.last_update() will throw "OSError: [Errno 9] Bad file descriptor" when it calls os.fstat() on a file descriptor that doesn't exist. But this seems like it'd be the same race condition that caused the ValueError reported in this bug. In the OS closing a file descriptor is atomic but in python maybe there's two steps and you get a different exception depending on your exact timing.

@benoitc
Copy link
Owner

benoitc commented Nov 18, 2014

@nbaggott are you under Python 3?

@nbaggott
Copy link

@benoitc python 2.6

@GuoJing
Copy link

GuoJing commented Dec 11, 2014

I need this fix, when is the next release?

Thank you.

@sholsapp
Copy link
Contributor

Ping.

@benoitc
Copy link
Owner

benoitc commented Dec 16, 2014

release is planned tomorrow, sorry for the delay.

@GuoJing
Copy link

GuoJing commented Dec 17, 2014

Thank you @benoitc .

@NadyaIonova
Copy link

@benoitc Hi! I'm waiting for this release too. Are there any plans to release soon? Thank you!

@tilgovi
Copy link
Collaborator

tilgovi commented Jan 14, 2015

Sorry, I may have been holding this up with review of an issue with the
threaded worker. @benoitc I'll try to do this tomorrow.

On Wed, Jan 14, 2015, 14:50 Nadya notifications@github.com wrote:

@benoitc https://github.com/benoitc Hi! I'm waiting for this release
too. Are there any plans to release soon? Thank you!


Reply to this email directly or view it on GitHub
#863 (comment).

@techiaith
Copy link

I can confirm this on python3.4/gunicorn 19.1.1
Not very nice for production servers!

[2015-01-20 10:02:57 +0000] [1124] [INFO] Unhandled exception in main loop:
Traceback (most recent call last):
  File "/home/techiaith/.virtualenvs/api_centre/lib/python3.4/site-packages/gunicorn/arbiter.py", line 174, in run
    self.sleep()
  File "/home/techiaith/.virtualenvs/api_centre/lib/python3.4/site-packages/gunicorn/arbiter.py", line 308, in sleep
    oldest = min(w.tmp.last_update() for w in worker_values)
  File "/home/techiaith/.virtualenvs/api_centre/lib/python3.4/site-packages/gunicorn/arbiter.py", line 308, in <genexpr>
    oldest = min(w.tmp.last_update() for w in worker_values)
  File "/home/techiaith/.virtualenvs/api_centre/lib/python3.4/site-packages/gunicorn/workers/workertmp.py", line 50, in last_update
    return os.fstat(self._tmp.fileno()).st_ctime
ValueError: I/O operation on closed file

If there's anything I can do to help @tilgovi, let me know.

@minusf
Copy link

minusf commented Jan 27, 2015

if you can find the time, please roll a minor release for this issue.

@benoitc
Copy link
Owner

benoitc commented Jan 27, 2015

@minusf working on it. should be done tomorrow .

@olivetree123
Copy link

I met this problem too , why does it happen ? ( python 2.7.5 , gunicorn 19.1.1 )

@benoitc
Copy link
Owner

benoitc commented Jun 23, 2015

@olivetree123 the issue has been fixed in latest stable release. You should upgrade to the latest one.

paulnivin pushed a commit to lyft/gunicorn that referenced this issue Dec 16, 2015
When the worker exited the tempfile is not available anymore so we can't
get the last update and calculate the dynamic timeout introduced in
d76bab4 .

This changes fix it by catching the IO error.

fix benoitc#863
paulnivin pushed a commit to lyft/gunicorn that referenced this issue Dec 16, 2015
The fix for issue benoitc#863 was incomplete as both OSError or ValueError could be
raised by os.fstat(). Thus we need to catch both types of exceptions
paulnivin pushed a commit to lyft/gunicorn that referenced this issue Dec 16, 2015
When the worker exited the tempfile is not available anymore so we can't
get the last update and calculate the dynamic timeout introduced in
d76bab4 .

This changes fix it by catching the IO error.

fix benoitc#863
paulnivin pushed a commit to lyft/gunicorn that referenced this issue Dec 16, 2015
The fix for issue benoitc#863 was incomplete as both OSError or ValueError could be
raised by os.fstat(). Thus we need to catch both types of exceptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants