Bug #12437
closedMutex Assert from PipeConnection::try_get_pipe
0%
Description
This occured during a trial run of cbt's ceph_test_rados benchmark while OSD 3 was marked out/down or up/in in a loop. State transitions occured when "ceph health" no longer reported degraded, peering, recovery_wait, stuck, inactive, unclean, or recovery warnings.
0> 2015-07-22 13:36:47.217698 7fed761ba700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fed761ba700 time 2015-07-22 13:36:47.213562 common/Mutex.cc: 95: FAILED assert(r == 0) ceph version 0.94.2-108-g45beb86 (45beb86423c3bd74dbafd36c6822e71ad9680e17) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x78) [0xbc9578] 2: (Mutex::Lock(bool)+0x105) [0xb79ff5] 3: (PipeConnection::try_get_pipe(Pipe**)+0x18) [0xca9828] 4: (SimpleMessenger::submit_message(Message*, PipeConnection*, entity_addr_t const&, int, bool)+0x66) [0xba5a96] 5: (SimpleMessenger::submit_message(Message*, PipeConnection*, entity_addr_t const&, int, bool)+0x427) [0xba5e57] 6: (SimpleMessenger::_send_message(Message*, Connection*)+0x97) [0xba7977] 7: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1fe) [0x6aca9e] 8: (PG::share_pg_info()+0x4d1) [0x7ed341] 9: (ReplicatedPG::snap_trimmer()+0x603) [0x84f953] 10: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x6d709a] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xbba226] 12: (ThreadPool::WorkThread::entry()+0x10) [0xbbb2d0] 13: (()+0x7ee5) [0x7fed95ee7ee5] 14: (clone()+0x6d) [0x7fed949c5b8d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Samuel Just almost 9 years ago
- Priority changed from Normal to Urgent
Updated by Mark Nelson almost 9 years ago
- File ceph_test_rados.txt.gz ceph_test_rados.txt.gz added
- File recovery.log.gz recovery.log.gz added
FWIW, this appears the have happened suspiciously close to a state transition where OSD 3 was marked down/out:
[Wed Jul 22 13:36:45 CDT 2015] Cluster appears to have healed. [Wed Jul 22 13:36:46 CDT 2015] Cluster is healthy, but repeat is set. Moving to markdown state. [Wed Jul 22 13:36:47 CDT 2015] Marking OSD 3 down. [Wed Jul 22 13:36:48 CDT 2015] Marking OSD 3 out. [Wed Jul 22 13:36:48 CDT 2015] Waiting for the cluster to break and heal
I've included the recovery log and ceph_test_rados output as well to show which operations were in flight at the time of the assert.
Updated by David Zafman almost 9 years ago
I used eclipse to determine the routines that reference Connection::lock. In Pipe::read_message() there is a lock/unlock pair with no code path around the unlock. In all other cases the Mutex::Locker is used so that a destructor will perform the unlock. There is no missing unlock, and the stack trace shows that there was no recursive code path that would cause the lock to be attempted to be locked twice.
I think there are only 2 possibilities left. Either there was a memory corruption which will be hard to find, or the Connection was destructed and EINVAL was returned because pthread_mutex_destroy() had been called on the Mutex.
Updated by Haomai Wang almost 9 years ago
I prefer the connection is destructed. For example, when OSDService calling send_message_osd_cluster and get connection, it only get the pointer to connection instead of ConnectionRef which will increase ref. After the first try_get_pipe, the Connection released and submit_message will try to call the connection's method and failed to lock.
Updated by David Zafman almost 9 years ago
Haomai Wang wrote:
I prefer the connection is destructed. For example, when OSDService calling send_message_osd_cluster and get connection, it only get the pointer to connection instead of ConnectionRef which will increase ref. After the first try_get_pipe, the Connection released and submit_message will try to call the connection's method and failed to lock.
Yes, we needed the ConnectionRef there while calling send_message().
Updated by David Zafman over 8 years ago
- Status changed from 7 to Pending Backport
- Backport set to firefly hammer
Updated by Nathan Cutler over 8 years ago
- Status changed from Pending Backport to Resolved