New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librbd: deadlock during cooperative exclusive lock transition #5319
Conversation
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit e5ffae5)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 21f990e)
Conditionally support helgrind annotations if valgrind support is enabled during the build. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 650ad32)
librbd use of an image hierarchy resulted in lock names being re-used and incorrectly analyzed. librbd now uses unique lock names per instance, but to prevent an unbounded growth of tracked locks, we now remove lock tracking once a lock is destructed. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 7c7df2c)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 6e400b9)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit b65ae4b)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit c1e1445)
This is needed to allow an atomic compare and update operation from the rebuild object map utility. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 2db758c)
The ObjectCacher complete the read callback while still holding the cache lock. This introduces lock ordering issues which are resolved by queuing the completion to execute in a clean (unlocked) context. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 0024677)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit c474ee4)
The callback routine most likely will attempt to retrieve the result code, which will result in a recursive lock attempt. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 3ad19ae)
It is only used by clients and it causes a large slowdown in performance due to the rate at which the lock is constructed/ destructed for each IO request. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 66e7464)
In order to support the invariant that all state machine callbacks occur without holding locks, transitions that don't always involve a librados call should queue their callback. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 218bc2d)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 5f157f2)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit c352bcd)
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Fix for: CID 1274319: Uninitialized scalar field (UNINIT_CTOR) uninit_member: Non-static class member m_object_state is not initialized in this constructor nor in any functions that it calls. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> (cherry picked from commit 48f18ea)
Moved all parent overlap computation to within AioRequest so that callers don't need to independently compute the overlap. Also removed the need to pass the snap_id for write operations since it can only be CEPH_NOSNAP. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 4651597)
Move AbstractWrite's invocation of copyup to the CopyupRequest class. The AioRequest write path will now always create a CopyupRequest, which will now append the actual write ops to the copyup. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 7be3df6)
Detected during an fsx run where a refresh and CoR were occurring concurrently. The refresh held the snap_lock and was waiting on the object_map_lock, while the CoR held object_map_lock and was waiting for snap_lock. Fixes: #11577 Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 8cbd92b)
It is expected that all IO is flushed and all async ops are cancelled prior to releasing the exclusive lock. Therefore, replace handling of lost exclusive locks in state machines with an assertion. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit d6b733d)
@dillaman I'll try a new strategy for integration that should deal with dependencies. Simply by merging into the integration branch in the same order commits are merged in master. That should automatically merge in order and if one pull request is indeed including all the commits from another, that should just work nicely. So you can rebase this one, it won't be a loss of time: both can be tested at the same time even though they depend on each other. |
@dachary sweet ... I'll take care of the rebases tonight. |
123726f
to
af67304
Compare
@dillaman when I try to merge this pull request on top of #5296, it conflicts. Which suggests the commits that look like they come from #5296 that are included in #5319 are not exactly the same really. Or I'm missing something ;-) Auto-merging src/librbd/ImageWatcher.cc CONFLICT (content): Merge conflict in src/librbd/ImageWatcher.cc Auto-merging src/common/Mutex.cc CONFLICT (content): Merge conflict in src/common/Mutex.cc Automatic merge failed; fix conflicts and then commit the result. |
If the exclusive lock owner acks the lock release request but crashes before it actually releases the lock, the requestor will wait forever. Therefore, after a certain timeout, retry the request again until it succeeds. Fixes: #11537 Backport: hammer Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 37c74e6)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 3e1e561)
Include the instance pointer so that different images can be differentiated in the logs. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit b951a73)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 879b8a7)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit d2a1c22)
Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit f97ce46)
It's possible that a stale notice is received and will be discarded after the request lock has been canceled. This will stale the client. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit d9dd5c5)
af67304
to
92272dd
Compare
@dachary try this -- I rebuilt the branch |
@dillaman this is what I see: http://paste2.org/fExaxzhm . FYI the reference is fetched with git fetch --force ceph +refs/pull/$pr/head:refs/remotes/ceph/pull/$pr/head |
@dachary this is what I see: http://fpaste.org/249506/81929831/ (I have a slightly different PR ref) |
@dillaman all is well, I shot myself in the foot (the ref of this pr was not updated because it was not selected because it had DNM in the title). After fetching the updated pr things are as you describe. Thanks for your patience. |
…ve lock transition Reviewed-by:
Passes the rbd suite http://tracker.ceph.com/issues/12701#rbd |
librbd: deadlock during cooperative exclusive lock transition Reviewed-by: Loic Dachary <ldachary@redhat.com>
http://tracker.ceph.com/issues/12235
Note: contains commits included in PR #5296 that can be rebased away once they merge.