New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaul in v2.1.x in release mode #4917
Comments
Since this seems to be a regression, we need to figure it out asap before shipping 2.1.5. |
William can't reproduce the issue anymore at the moment and we don't understand how it could have happened. |
I am experiencing this error as well since upgrading from 2.1.4 to 2.1.5. I'm on Ubuntu 14.04. I can provide a core dump if needed. [edit] |
@jorrit Thanks for the report. For now I recommend downgrading back to 2.1.4. Could you post the full backtrace please? (the one above didn't include addresses, which should now be visible) A core file would be awesome. Can you email me to daniel@rethinkdb.com please? I will send you instructions on where you can upload the file. Since this bug actually appears to be very common, I've decided to take out the 2.1.5 release from our repositories again until we can resolve this. Sorry for the inconvenience. |
@jorrit Is there a chance you could also give us a copy of your data files? We are happy to sign an NDA if necessary. In any case, the full backtrace and the core file would be extremely valuable already. Thanks a lot for your help! |
The trace:
I have sent you an e-mail with the core dump file. |
I do am experiencing this throughout after upgraded to 2.1.5. I'm running on Ubuntu. However, the server that is running Debian seems to be fine. Where would I get the full stack trace? We had to shutdown all the ubuntu servers because it is completely unstable and the servers would not stay running for more than 5minutes. We will rollback to 2.1.4. |
@mshi do you know the query that triggers this in your case? Which version of Ubuntu are you on? Like mentioned above, I recommend downgrading back to 2.1.4 until we've resolved this. We've removed 2.1.5 from our Ubuntu repository, so if you do a |
I believe that same thing is happening on my machine, Ubuntu 15.04
|
@danielmewes I am not sure which exact queries because we have a lot of queries and the error logs were filled with errors that happened after the crash. I will take a look into it later today, but for now I need to roll back all of our servers. But it seems like even servers that are replicas (not primary) crashes. So some sort of read. Also worth to re-iterate that this only happens on ubuntu. |
Also, I guess we have to downgrade all of our servers down? |
@mshi -- yes, you don't want to mix 2.1.4 and 2.1.5 in your clusters right now. Sorry everyone about the inconvenience -- this is totally our fault. We'll hunt this bug down ASAP; in the meantime if anyone want a free t-shirt/stickers/water bottles to (barely) compensate you for your troubles, please email christina@rethinkdb.com with your address/shirt size, and we'll set you up. |
It took us a bunch of hours looking at disassembler output, working with the debugger and looking through code changes, but we finally found the bug. We're going to ship an updated release asap. |
You can't keep us hanging like that man! What's the bug? :) |
Let me wrap up testing and start the build first, then the secret will be revealed. |
@coffeemug -- It's a bug in rethinkdb/src/rdb_protocol/btree.cc Line 563 in b439f9e
|
(The reason it took so long to debug is that the bug manifested in a totally different place than it originated.) |
👏 |
Has 2.1.5-2 been (re)released? I see it on github, but not on download.rethinkdb.com/dist |
@mbrevda Not yet. We're currently building the packages. This typically takes a few hours. We're then going to upload them to our server, at which time the new package are going to appear in the file list. |
Cool. Are there any other holdups for Homebrew/legacy-homebrew#44707 once the package shows up? |
No I think we only need to remove the ICU dependency from the recipe then and it should be good to go. |
Thats allready done. Crazy day, huh? Thanks! |
@danielmewes is there any way of reliably triggering this? Mainly I am trying to make sure that if there is a reasonable test to write out of this we get this class of regression covered in testing. |
@larkost Testing for a bug like this in general is hard (at least I don't know how). A pointer wasn't being updated when moving an object, and so it kept pointing to an old memory location. Some other code would then access that pointer and overwrite whichever values was at that point stored in that location. Usually Valgrind should catch such problems, but we tried it and it didn't catch the problem in this case. It's possible that there are certain parameters that would enable Valgrind to catch this. I think that would be worth investigating. You can build a binary that works with Valgrind as follows:
(the DEBUG=1 is optional) There is also a suppressions file which should be used to avoid a bunch of false alarms:
In this particular case, the crash could actually be reproduced by the Python connection test, among many others. In fact any test that used a cursor and retrieved more than one batch from it would run into the segmentation fault. However this was only the case with certain compiler and/or library versions, because of memory layouts being different. On |
@danielmewes -- My theory about why Valgrind didn't catch the error is that I think the object we moved out of was stored on the stack, and that GCC knew enough about the semantics of @larkost -- the best change to the test system we could make to catch bugs like this in the future would be to test the compiled release binaries on the platforms they're built for. Our tests reliably reproduced this bug on the Ubuntu 14 build, but not on Newton. |
From William:
When stress testing I got to a point where I could only start 1 of my
three rethinkdb servers (under load from clients), and every time I
tried to start again I would get the Segmentation fault below
(according to the log) [1]. This happens repeatedly within seconds of
restarting rethinkdb. I added a few gigs of swap in case memory was
relevant, but that made no difference. I tried stopping all clients
and then could start rethinkdb, but it quickly segfaulted as soon as I
started the client with load. I then reverted to "rethinkdb
2.1.4~0vivid (GCC 4.9.2)", and exactly the same tests (with the same
exact data) worked fine.
[1] Repeatably segfault:
The text was updated successfully, but these errors were encountered: