Skip to content
This repository has been archived by the owner on Nov 17, 2020. It is now read-only.

Test for dynamic_reconfiguration hangs. #7

Closed
jemc opened this issue Jun 23, 2015 · 15 comments
Closed

Test for dynamic_reconfiguration hangs. #7

jemc opened this issue Jun 23, 2015 · 15 comments
Assignees
Labels
Milestone

Comments

@jemc
Copy link

jemc commented Jun 23, 2015

I've been having some issues with the federation plugin (that, incidentally, may have to do with dynamic reconfiguration), so I thought I would clone the repository, get the test suite running locally, and try to make a failing test for my issue (which I'm still trying to pinpoint to a minimal reproduction).

However, when running the test suite locally, the tests hang indefinitely at rabbit_federation_exchange_test:dynamic_reconfiguration

git clone https://github.com/rabbitmq/rabbitmq-public-umbrella.git
cd rabbitmq-public-umbrella
make co
cd rabbitmq-federation
make test
rm -rf /tmp/rabbitmq-multi-node/plugins
mkdir -p /tmp/rabbitmq-multi-node/plugins/plugins
cp -p ../rabbitmq-federation/dist/*.ez /tmp/rabbitmq-multi-node/plugins/plugins
make: Circular .+in-broker-test <- .+standalone-test dependency dropped.
rm -rf /tmp/rabbitmq-test
mkdir -p /tmp/rabbitmq-test/log /tmp/rabbitmq-test/plugins /tmp/rabbitmq-test/rabbit-test
cp -p ./dist/*.ez /tmp/rabbitmq-test/plugins
rm -f /tmp/rabbitmq-test/plugins/rabbit_common*.ez
RABBITMQ_PLUGINS_DIR=/tmp/rabbitmq-test/plugins RABBITMQ_ENABLED_PLUGINS_FILE=/tmp/rabbitmq-test/enabled_plugins ../rabbitmq-server/scripts/rabbitmq-plugins set --offline $(RABBITMQ_PLUGINS_DIR=/tmp/rabbitmq-test/plugins RABBITMQ_ENABLED_PLUGINS_FILE=/tmp/rabbitmq-test/enabled_plugins ../rabbitmq-server/scripts/rabbitmq-plugins list -m | tr '\n' ' ')
The following plugins are now enabled:
  rabbitmq_test
  rabbitmq_federation
  amqp_client
Offline change; changes will take effect at broker restart.
MAKE="make" RABBITMQ_PLUGINS_DIR=/tmp/rabbitmq-test/plugins RABBITMQ_ENABLED_PLUGINS_FILE=/tmp/rabbitmq-test/enabled_plugins RABBITMQ_LOG_BASE=/tmp/rabbitmq-test/log RABBITMQ_MNESIA_BASE=/tmp/rabbitmq-test/rabbit-test RABBITMQ_PID_FILE=/tmp/rabbitmq-test/rabbit-test.pid RABBITMQ_NODENAME=rabbit-test RABBITMQ_SERVER_START_ARGS='-pa ./test/ebin -coverage directories ["./ebin","./test/ebin"]' RABBITMQ_CONFIG_FILE= ../rabbitmq-server/scripts/rabbitmq-server &
../rabbitmq-server/scripts/rabbitmqctl -n rabbit-test wait /tmp/rabbitmq-test/rabbit-test.pid
Waiting for 'rabbit-test@localhost' ...
pid is 16292 ...

              RabbitMQ %%VSN%%. Copyright (C) 2007-2015 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /tmp/rabbitmq-test/log/rabbit-test.log
  ######  ##        /tmp/rabbitmq-test/log/rabbit-test-sasl.log
  ##########
              Starting broker... completed with 3 plugins.
echo > /tmp/rabbitmq-test/rabbit-test-output && if ./etc/setup-rabbit-test.sh &&  echo >> /tmp/rabbitmq-test/rabbit-test-output && echo "rabbit_test_runner:run_in_broker(\"./test/ebin\",\"all\")." | tee -a /tmp/rabbitmq-test/rabbit-test-output | erl_call -sname rabbit-test -e | tee -a /tmp/rabbitmq-test/rabbit-test-output | egrep "{ok, (ok|passed)}" >/dev/null && MAKE="make" RABBITMQ_NODENAME="rabbit-test"  : ; then touch /tmp/rabbitmq-test/.passed ; printf "\nPASSED\n" ; else cat /tmp/rabbitmq-test/rabbit-test-output ; printf "\n\nFAILED\n" ; fi
Setting runtime parameter "localhost" for component "federation-upstream" to "{\"uri\": \"amqp://\"}" ...
Setting runtime parameter "local5673" for component "federation-upstream" to "{\"uri\": \"amqp://guest:guest@localhost:5673\"}" ...
Setting runtime parameter "upstream" for component "federation-upstream-set" to "[{\"upstream\": \"localhost\", \"exchange\": \"upstream\", \"queue\": \"upstream\"}]" ...
Setting runtime parameter "upstream2" for component "federation-upstream-set" to "[{\"upstream\": \"localhost\", \"exchange\": \"upstream2\", \"queue\": \"upstream2\"}]" ...
Setting runtime parameter "localhost" for component "federation-upstream-set" to "[{\"upstream\": \"localhost\"}]" ...
Setting runtime parameter "upstream12" for component "federation-upstream-set" to "[{\"upstream\": \"localhost\", \"exchange\": \"upstream\", \"queue\": \"upstream\"},\n                                                          {\"upstream\": \"localhost\", \"exchange\": \"upstream2\", \"queue\": \"upstream2\"}]" ...
Setting runtime parameter "one" for component "federation-upstream-set" to "[{\"upstream\": \"localhost\", \"exchange\": \"one\", \"queue\": \"one\"}]" ...
Setting runtime parameter "two" for component "federation-upstream-set" to "[{\"upstream\": \"localhost\", \"exchange\": \"two\", \"queue\": \"two\"}]" ...
Setting runtime parameter "upstream5673" for component "federation-upstream-set" to "[{\"upstream\": \"local5673\", \"exchange\": \"upstream\"}]" ...
Setting policy "fed" for pattern "^fed\\." to "{\"federation-upstream-set\": \"upstream\"}" with priority "0" ...
Setting policy "fed12" for pattern "^fed12\\." to "{\"federation-upstream-set\": \"upstream12\"}" with priority "0" ...
Setting policy "one" for pattern "^two$" to "{\"federation-upstream-set\": \"one\"}" with priority "0" ...
Setting policy "two" for pattern "^one$" to "{\"federation-upstream-set\": \"two\"}" with priority "0" ...
Setting policy "hare" for pattern "^hare\\." to "{\"federation-upstream-set\": \"upstream5673\"}" with priority "0" ...
Setting policy "all" for pattern "^all\\." to "{\"federation-upstream-set\": \"all\"}" with priority "0" ...
Setting policy "new" for pattern "^new\\." to "{\"federation-upstream-set\": \"new-set\"}" with priority "0" ...

In-broker tests
================

Running 24 of 24 tests; FILTER=all


rabbit_federation_unit_test
---------------------------
serialisation:                     [running] [PASSED].
scratch_space:                     [running] [PASSED].
remove_credentials:                [running] [PASSED].

rabbit_federation_exchange_test
-------------------------------
simple:                            [running] [PASSED].
multiple_upstreams:                [running] [PASSED].
multiple_uris:                     [running] [PASSED].
multiple_downstreams:              [running] [PASSED].
e2e:                               [running] [PASSED].
unbind_on_delete:                  [running] [PASSED].
unbind_on_unbind:                  [running] [PASSED].
unbind_gets_transmitted:           [running] [PASSED].
no_loop:                           [running] [PASSED].
dynamic_reconfiguration:           [running]rabbit_test_runner: make_test_single...*timed out*
undefined
=======================================================
  Failed: 0.  Skipped: 0.  Passed: 12.
One or more tests were cancelled.


rabbit_test_runner:run_in_broker("./test/ebin","all").
{ok, error}

FAILED
sleep 1
echo "rabbit_misc:report_cover(), init:stop()." | erl_call -sname rabbit-test -e
{ok, ok}sleep 1
test -f /tmp/rabbitmq-test/.passed
../do-package.mk:533: recipe for target '.+in-broker-test' failed
make: *** [.+in-broker-test] Error 1
@jemc
Copy link
Author

jemc commented Jun 24, 2015

If I tail -f /tmp/rabbitmq-test/log/rabbit-test.log, I can watch the rabbitmq server's log messages during the tests. Here is the excerpt of the log covering the connection for the dynamic_reconfiguration test (until it hangs).

=INFO REPORT==== 24-Jun-2015::09:34:13 ===
accepting AMQP connection <0.1925.0> (127.0.0.1:51928 -> 127.0.0.1:5672)

=WARNING REPORT==== 24-Jun-2015::09:34:13 ===
Channel (<0.2004.0>): Unregistering confirm handler <0.1972.0> because it died. Reason: shutdown

=INFO REPORT==== 24-Jun-2015::09:34:13 ===
Federation exchange 'all.fed1' in vhost '/' connected to exchange 'all.fed1' in vhost '/' on amqp://

=INFO REPORT==== 24-Jun-2015::09:34:13 ===
Federation exchange 'all.fed1' in vhost '/' disconnecting from exchange 'all.fed1' in vhost '/' on amqp://

=WARNING REPORT==== 24-Jun-2015::09:34:13 ===
Channel (<0.2023.0>): Unregistering confirm handler <0.1979.0> because it died. Reason: shutdown

=WARNING REPORT==== 24-Jun-2015::09:34:13 ===
Channel (<0.1971.0>): Unregistering confirm handler <0.1946.0> because it died. Reason: shutdown

=WARNING REPORT==== 24-Jun-2015::09:34:13 ===
Channel (<0.2111.0>): Unregistering confirm handler <0.2086.0> because it died. Reason: shutdown

=WARNING REPORT==== 24-Jun-2015::09:34:13 ===
Channel (<0.2151.0>): Unregistering confirm handler <0.2107.0> because it died. Reason: shutdown

=ERROR REPORT==== 24-Jun-2015::09:34:13 ===
Channel error on connection <0.2112.0> (<rabbit-test@localhost.1.2112.0>, vhost: '/', user: 'none'), channel 2:
{amqp_error,not_found,"no exchange 'all.fed2' in vhost '/'",
            'exchange.declare'}

@michaelklishin
Copy link
Member

Hmm, I can't reproduce this locally.

@jemc
Copy link
Author

jemc commented Jun 24, 2015

@michaelklishin - is there anything I can do locally to give you more useful information for troubleshooting? I spent some time trying to connect to the running erlang node and get a backtrace, but didn't get that working - if you have a procedure for that or for retrieving some other debugging info I'd be happy to get you whatever you need - I'm not afraid to get my hands dirty here.

Also, I'm on freenode as |jemc| if you want to get in touch and do some interactive troubleshooting on this.

@michaelklishin
Copy link
Member

You can attach a shell to a running node and run the observer app from there, for example.

@jemc
Copy link
Author

jemc commented Jun 24, 2015

Here's the result of i(). on the rabbit-test node while hung - let me know if anything else would be useful here.
https://gist.github.com/jemc/e76167278c52cbc9179a

@michaelklishin
Copy link
Member

Does this correlate with the log (namely, can I assume that the node was not restarted between you've taken this log and attached a shell?

@jemc
Copy link
Author

jemc commented Jun 24, 2015

@michaelklishin - This is not true of the rabbitmq log output above, but shortly after posting the gist I attached a second file with rabbitmq log output correlated to the erlang shell output, without restarting the rabbitmq node. So those two files in the gist are correlated as you describe.

@jemc
Copy link
Author

jemc commented Jun 24, 2015

So I thought I'd try to reproduce this on Circle CI, which has the nice feature that the owner of a given build can SSH into the CI VM to troubleshoot what went wrong if the build fails.

When I did this, I hit a different problem before reaching the test with the problem I filed this issue ticket about, but I wanted to mention it here first (if you want me to file a separate ticket, though, I can). If you like you can make a personal fork of rabbitmq-federation on github, add a Circle CI config file like I added to my personal fork, then log into Circle CI and enable your fork for testing. As the owner of the build you can SSH into the build as you see fit.

Alternatively, I don't know what OS you run on your dev machine (I use Linux - Fedora 22), but you might find you run into some of these same problems if you run the tests in some flavor of Linux in a VM on your dev machine (it looks like Circle CI uses some version of Ubuntu).

Update: When run again, the Circle CI test hung in a different place.

@michaelklishin
Copy link
Member

I used OS X, our CI machines use Debian.

@Gsantomaggio
Copy link
Member

I can reproduce the same problem using a Debian docker image.
dynamic_reconfiguration: [running]
rabbit_test_runner: make_test_single...*timed out* undefined

Using OSX I don't have the problem.

If you need:

1 - docker run -i -t debian /bin/bash
2 - execute this: https://gist.github.com/Gsantomaggio/fb2c98fdf46b307c4c85
3 - cd rabbitmq-federation && make test

@dcorbacho
Copy link
Contributor

The rabbit_federation_exchange_test:dynamic_reconfiguration_test/0 gets blocked under certain conditions (i.e. Ubuntu in a single core).
This is caused by a race condition, which can be avoided by adding a timer:sleep/1 in the test.

The test gets blocked during the cleanup, when it tries to delete the exchanges created. It is the supervisor blocked in stop child (rabbit_federation_exchange_link_sup_sup:stop_child/1), as it has an almost infinity timeout (16#ffffffff == 4294967295 ms == 1193 hours).
The supervisor can't stop the exchange, which is in fact still starting.

This is the trace of what the exchange was doing and where it got blocked, thus blocking the supervisor:

rabbit_federation_exchange_link:go/1
rabbit_federation_link_util:start_conn_ch/5
rabbit_federation_exchange_link:ensure_upstream_bindings/2
rabbit_federation_exchange_link:ensure_upstream_exchange/1
rabbit_federation_link_util:disposable_channel_call/3

The amqp_channel:call at https://github.com/rabbitmq/rabbitmq-federation/blob/stable/src/rabbit_federation_link_util.erl#L268 fails with an exit:shutdown and the catch section is executed.
The ErrFun in the catch blocks, while it is doing the amqp_channel:call/2 in https://github.com/rabbitmq/rabbitmq-federation/blob/stable/src/rabbit_federation_exchange_link.erl#L489

The amqp_channel:call/2 calls rabbit_channel:do/2 with #'exchange.declare', which then calls rabbit_exchange:lookup/1.
The lookup returns not_found, as the test we are executing just called amqp_channel with #'exchange.delete' and is in the middle of the deletion.
Then, rabbit_exchange:declare/6 is called after the lookup. The declare is blocked in the mnesia transaction https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_exchange.erl#L166

Thus, at this point RabbitMQ is trying to declare again the exchange while it is being deleted by the supervisor.

Update 29/12/15
The mnesia transaction gets blocked when it tries to create the exchange as part of the callback: https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_exchange.erl#L179

That calls:

rabbit_federation_exchange:create/2
rabbit_federation_exchange:maybe_start/1
rabbit_federation_exchange_link_sup_sup:start_child/1

Thus, the supervisor is waiting for a call that calls the supervisor itself -> deadlock.

The proposed solution sets an status deleting-exchange using runtime parameters, which avoids the declare to continue until the deadlock.

PR in rabbitmq/rabbitmq-server#533

@michaelklishin
Copy link
Member

Thank you, @Ayanda-D and @dcorbacho. We will QA the issue before Monday noon and produce a build for the user to try.

@michaelklishin michaelklishin added this to the 3.6.1 milestone Jan 11, 2016
@michaelklishin
Copy link
Member

We believe this is fixed in rabbitmq/rabbitmq-server#533. Happy to provide a one-off build from stable for interested folks to help verify.

@michaelklishin
Copy link
Member

@dcorbacho also: can we switch related supervisors to use a more sensible timeout, e.g. 30 seconds?

@dcorbacho
Copy link
Contributor

I'll look at it @michaelklishin

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants