Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaul in v2.1.x in release mode #4917

Closed
mlucy opened this issue Oct 5, 2015 · 27 comments
Closed

Segfaul in v2.1.x in release mode #4917

mlucy opened this issue Oct 5, 2015 · 27 comments
Assignees
Milestone

Comments

@mlucy
Copy link
Member

mlucy commented Oct 5, 2015

From William:

When stress testing I got to a point where I could only start 1 of my
three rethinkdb servers (under load from clients), and every time I
tried to start again I would get the Segmentation fault below
(according to the log) [1]. This happens repeatedly within seconds of
restarting rethinkdb. I added a few gigs of swap in case memory was
relevant, but that made no difference. I tried stopping all clients
and then could start rethinkdb, but it quickly segfaulted as soon as I
started the client with load. I then reverted to "rethinkdb
2.1.4~0vivid (GCC 4.9.2)", and exactly the same tests (with the same
exact data) worked fine.

[1] Repeatably segfault:

2015-10-05T04:56:46.573479760 40.338882s error: Error in
src/arch/runtime/thread_pool.cc at line 359:
2015-10-05T04:56:46.573505337 40.338907s error: Segmentation fault
from reading the address 0x7f01b7a228c8.
2015-10-05T04:56:46.573512359 40.338914s error: Backtrace:
2015-10-05T04:56:46.681749975 40.447153s error: Mon Oct  5 04:56:46
2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2:
format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*,
int, char const*, ...) at ??:?\n4:
linux_thread_pool_t::fatal_signal_handler
(int, siginfo_t*, void*) at ??:?\n5:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10d10) [0x7feccf2c3d10] at
0x7feccf2c3d10 (/lib/x86_64-linux-gnu/libpthread.so.0)\n6:
rget_cb_t::handle_pair(scoped_key_value_t&&, unsigned long,
concurrent_traversal_fifo_enforcer_si
gnal_t) at ??:?\n7:
concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*,
semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)
at ??:?\n8: callable_action_instance_t<std::_Bind<std::_Mem_fn<void
(concurrent_traversal_adapter_t:
:*)(scoped_key_value_t*, semaphore_acq_t*,
fifo_enforcer_write_token_t, auto_drainer_t::lock_t)>
(concurrent_traversal_adapter_t*, scoped_key_value_t*,
semaphore_acq_t*, fifo_enforcer_write_token_t,
auto_drainer_t::lock_t)> >::run_action() at ??:?\n9: coro_t::r
un() at ??:?
2015-10-05T04:56:46.681791172 40.447193s error: Exiting.
@mlucy mlucy added the tp:bug label Oct 5, 2015
@mlucy mlucy self-assigned this Oct 5, 2015
@mlucy mlucy added this to the 2.1.x milestone Oct 5, 2015
@danielmewes danielmewes modified the milestones: 2.1.5, 2.1.x Oct 5, 2015
@danielmewes
Copy link
Member

Since this seems to be a regression, we need to figure it out asap before shipping 2.1.5.

@danielmewes
Copy link
Member

William can't reproduce the issue anymore at the moment and we don't understand how it could have happened.
We'll ship 2.1.5 despite this since there are a lot of other relevant bug fixes in there.

@jorrit
Copy link

jorrit commented Oct 7, 2015

I am experiencing this error as well since upgrading from 2.1.4 to 2.1.5.

I'm on Ubuntu 14.04. I can provide a core dump if needed.

[edit]
The crash happens after the first connection is made. I don't know if a specific query triggers the crash, I would need to investigate that. However, the crash is reproducible and does not occur after just starting the service.

@danielmewes
Copy link
Member

@jorrit Thanks for the report. For now I recommend downgrading back to 2.1.4.

Could you post the full backtrace please? (the one above didn't include addresses, which should now be visible)

A core file would be awesome. Can you email me to daniel@rethinkdb.com please? I will send you instructions on where you can upload the file.

Since this bug actually appears to be very common, I've decided to take out the 2.1.5 release from our repositories again until we can resolve this. Sorry for the inconvenience.
For everyone affected by this, I recommend downgrading back to 2.1.4.

@danielmewes
Copy link
Member

@jorrit Is there a chance you could also give us a copy of your data files? We are happy to sign an NDA if necessary.

In any case, the full backtrace and the core file would be extremely valuable already. Thanks a lot for your help!

@jorrit
Copy link

jorrit commented Oct 7, 2015

The trace:
Error in src/arch/runtime/thread_pool.cc at line 359:
Segmentation fault from reading the address 0x7f013e93d888.
Backtrace:

1 [0xa0f920]: backtrace_t::backtrace_t() at 0xa0f920 (/usr/bin/rethinkdb)
2 [0xa0fcb3]: format_backtrace(bool) at 0xa0fcb3 (/usr/bin/rethinkdb)
3 [0xc440b5]: report_fatal_error(char const*, int, char const*, ...) at 0xc440b5 (/usr/bin/rethinkdb)
4 [0x931193]: linux_thread_pool_t::fatal_signal_handler(int, siginfo_t*, void*) at 0x931193 (/usr/bin/rethinkdb)
5 [0x7fd347eb7340]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fd347eb7340] at 0x7fd347eb7340 (/lib/x86_64-linux-gnu/libpthread.so.0)
6 [0x7a25ba]: rget_cb_t::handle_pair(scoped_key_value_t&&, unsigned long, concurrent_traversal_fifo_enforcer_signal_t) at 0x7a25ba (/usr/bin/rethinkdb)
7 [0x9936f1]: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0x9936f1 (/usr/bin/rethinkdb)
8 [0x993474]: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0x993474 (/usr/bin/rethinkdb)
9 [0x933748]: coro_t::run() at 0x933748 (/usr/bin/rethinkdb)
Exiting.

I have sent you an e-mail with the core dump file.

@mshi
Copy link

mshi commented Oct 7, 2015

I do am experiencing this throughout after upgraded to 2.1.5. I'm running on Ubuntu. However, the server that is running Debian seems to be fine. Where would I get the full stack trace?

We had to shutdown all the ubuntu servers because it is completely unstable and the servers would not stay running for more than 5minutes. We will rollback to 2.1.4.

@danielmewes
Copy link
Member

@mshi do you know the query that triggers this in your case? Which version of Ubuntu are you on?

Like mentioned above, I recommend downgrading back to 2.1.4 until we've resolved this. We've removed 2.1.5 from our Ubuntu repository, so if you do a sudo apt-get update and then re-install RethinkDB, it should automatically pick up 2.1.4 again.

@bobiblazeski
Copy link

I believe that same thing is happening on my machine, Ubuntu 15.04

Version: rethinkdb 2.1.5+1~0vivid (GCC 4.9.2)
error: Error in src/arch/runtime/thread_pool.cc at line 359:
Version: rethinkdb 2.1.5+1~0vivid (GCC 4.9.2)
error: Segmentation fault from reading the address 0x7f01080a2878.
Version: rethinkdb 2.1.5+1~0vivid (GCC 4.9.2)
Version: rethinkdb 2.1.5+1~0vivid (GCC 4.9.2)
error: Error in src/arch/runtime/thread_pool.cc at line 359:
error: Error in src/arch/runtime/thread_pool.cc at line 359:
error: Backtrace:
error: Segmentation fault from reading the address 0x7f0107de2878.
error: Segmentation fault from reading the address 0x7f0107d02878.
error: Backtrace:
error: Backtrace:
error: Error in src/arch/runtime/thread_pool.cc at line 359:
error: Segmentation fault from reading the address 0x7f0107e02878.
error: Backtrace:
addr2line: addr2line: 'rethinkdb': No such fileaddr2line: 
'rethinkdb': No such file
addr2line: addr2line: 'rethinkdb': No such file
'rethinkdb': No such file
addr2line: 'rethinkdb': No such file
'rethinkdb': No such file
addr2line: 'rethinkdb': No such file
addr2line: 'rethinkdb': No such file
error: Wed Oct  7 19:37:00 2015

       1 [0xa5a2d0]: backtrace_t::backtrace_t() at 0xa5a2d0 (rethinkdb)
       2 [0xa5a69c]: format_backtrace(bool) at 0xa5a69c (rethinkdb)
       3 [0xcc24f3]: report_fatal_error(char const*, int, char const*, ...) at 0xcc24f3 (rethinkdb)
       4 [0x961334]: linux_thread_pool_t::fatal_signal_handler(int, siginfo_t*, void*) at 0x961334 (rethinkdb)
       5 [0x7fbb1532cd10]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x10d10) [0x7fbb1532cd10] at 0x7fbb1532cd10 (/lib/x86_64-linux-gnu/libpthread.so.0)
       6 [0x79758a]: rget_cb_t::handle_pair(scoped_key_value_t&&, unsigned long, concurrent_traversal_fifo_enforcer_signal_t) at 0x79758a (rethinkdb)
       7 [0x9c9134]: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0x9c9134 (rethinkdb)
       8 [0x9c8ec6]: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0x9c8ec6 (rethinkdb)
       9 [0x963ac8]: coro_t::run() at 0x963ac8 (rethinkdb)
error: Exiting.
error: Wed Oct  7 19:37:00 2015

       1 [0xa5a2d0]: backtrace_t::backtrace_t() at 0xa5a2d0 (rethinkdb)
       2 [0xa5a69c]: format_backtrace(bool) at 0xa5a69c (rethinkdb)
       3 [0xcc24f3]: report_fatal_error(char const*, int, char const*, ...) at 0xcc24f3 (rethinkdb)
       4 [0x961334]: linux_thread_pool_t::fatal_signal_handler(int, siginfo_t*, void*) at 0x961334 (rethinkdb)
       5 [0x7fbb1532cd10]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x10d10) [0x7fbb1532cd10] at 0x7fbb1532cd10 (/lib/x86_64-linux-gnu/libpthread.so.0)
       6 [0x79758a]: rget_cb_t::handle_pair(scoped_key_value_t&&, unsigned long, concurrent_traversal_fifo_enforcer_signal_t) at 0x79758a (rethinkdb)
       7 [0x9c9134]: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at 0x9c9134 (rethinkdb)
       8 [0x9c8ec6]: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at 0x9c8ec6 (rethinkdb)
       9 [0x963ac8]: coro_t::run() at 0x963ac8 (rethinkdb)
^CTrace/breakpoint trap (core dumped)

@mshi
Copy link

mshi commented Oct 7, 2015

@danielmewes I am not sure which exact queries because we have a lot of queries and the error logs were filled with errors that happened after the crash. I will take a look into it later today, but for now I need to roll back all of our servers. But it seems like even servers that are replicas (not primary) crashes. So some sort of read. Also worth to re-iterate that this only happens on ubuntu.

@mshi
Copy link

mshi commented Oct 7, 2015

Also, I guess we have to downgrade all of our servers down?
Connection attempt from [::ffff:104.196.11.133]:55203 failed, reason: unrecognized or incompatible version (local: 2.1.5, remote: 2.1.2) Even though one server is 2.1.4 and the other is 2.1.5

@coffeemug
Copy link
Contributor

@mshi -- yes, you don't want to mix 2.1.4 and 2.1.5 in your clusters right now.

Sorry everyone about the inconvenience -- this is totally our fault. We'll hunt this bug down ASAP; in the meantime if anyone want a free t-shirt/stickers/water bottles to (barely) compensate you for your troubles, please email christina@rethinkdb.com with your address/shirt size, and we'll set you up.

@danielmewes
Copy link
Member

It took us a bunch of hours looking at disassembler output, working with the debugger and looking through code changes, but we finally found the bug.

We're going to ship an updated release asap.

@coffeemug
Copy link
Contributor

You can't keep us hanging like that man! What's the bug? :)

@danielmewes
Copy link
Member

Let me wrap up testing and start the build first, then the secret will be revealed.

@mlucy
Copy link
Member Author

mlucy commented Oct 8, 2015

@coffeemug -- It's a bug in job_data_t class. The class definition is really small, see if you can spot it! (

class job_data_t {
)

@mlucy
Copy link
Member Author

mlucy commented Oct 8, 2015

(The reason it took so long to debug is that the bug manifested in a totally different place than it originated.)

danielmewes pushed a commit that referenced this issue Oct 8, 2015
The job_data_t move constructor would move the accumulator which had a pointer to the batcher, but the batcher would not move with it.

OTS reviewed by @VeXocide.
danielmewes pushed a commit that referenced this issue Oct 8, 2015
The job_data_t move constructor would move the accumulator which had a pointer to the batcher, but the batcher would not move with it.

OTS reviewed by @VeXocide.
@danielmewes
Copy link
Member

Fixed in v2.1.x e23bd5b and next 8d9b5d1

@coffeemug
Copy link
Contributor

👏

@danielmewes danielmewes modified the milestones: 2.1.5, 2.1.x Oct 8, 2015
@mbrevda
Copy link

mbrevda commented Oct 8, 2015

Has 2.1.5-2 been (re)released? I see it on github, but not on download.rethinkdb.com/dist

@danielmewes
Copy link
Member

@mbrevda Not yet. We're currently building the packages. This typically takes a few hours. We're then going to upload them to our server, at which time the new package are going to appear in the file list.

@mbrevda
Copy link

mbrevda commented Oct 8, 2015

Cool. Are there any other holdups for Homebrew/legacy-homebrew#44707 once the package shows up?

@danielmewes
Copy link
Member

‎No I think we only need to remove the ICU dependency from the recipe then and it should be good to go. 

@mbrevda
Copy link

mbrevda commented Oct 8, 2015

Thats allready done. Crazy day, huh? Thanks!

@larkost
Copy link
Collaborator

larkost commented Oct 8, 2015

@danielmewes is there any way of reliably triggering this? Mainly I am trying to make sure that if there is a reasonable test to write out of this we get this class of regression covered in testing.

@danielmewes
Copy link
Member

@larkost Testing for a bug like this in general is hard (at least I don't know how). A pointer wasn't being updated when moving an object, and so it kept pointing to an old memory location. Some other code would then access that pointer and overwrite whichever values was at that point stored in that location.

Usually Valgrind should catch such problems, but we tried it and it didn't catch the problem in this case. It's possible that there are certain parameters that would enable Valgrind to catch this. I think that would be worth investigating. You can build a binary that works with Valgrind as follows:

make VALGRIND=1 ALLOCATOR=system DEBUG=1

(the DEBUG=1 is optional)

There is also a suppressions file which should be used to avoid a bunch of false alarms:

valgrind --suppressions=scripts/rethinkdb-valgrind-suppressions.supp build/debug_valgrind_system/rethinkdb --no-update-check ...

In this particular case, the crash could actually be reproduced by the Python connection test, among many others. In fact any test that used a cursor and retrieved more than one batch from it would run into the segmentation fault. However this was only the case with certain compiler and/or library versions, because of memory layouts being different. On newton as well as the other machines that we are running our tests on for example the memory corruption would remain (seemingly) without any obvious consequences.
I don't think there's much value in writing a separate regression test for this particular bug, since it's extremely unlikely that we'll make the same mistake in the exact same place again. Instead it would be neat if we could make Valgrind catch these properly.

@mlucy
Copy link
Member Author

mlucy commented Oct 8, 2015

@danielmewes -- My theory about why Valgrind didn't catch the error is that I think the object we moved out of was stored on the stack, and that GCC knew enough about the semantics of std::move to know it could re-use the space for something else, but the space was never freed and reallocated or anything (since it was on the stack) so Valgrind couldn't catch it.

@larkost -- the best change to the test system we could make to catch bugs like this in the future would be to test the compiled release binaries on the platforms they're built for. Our tests reliably reproduced this bug on the Ubuntu 14 build, but not on Newton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants