mds: fix client cap/message replay order on restart #7199

ukernel · 2016-01-12T09:18:58Z

No description provided.

gregsfortytwo · 2016-01-13T01:57:24Z

src/mds/MDCache.cc

+  for (map<inodeno_t, list<MDSInternalContextBase*> >::iterator p = cap_reconnect_waiters.begin();
+       p != cap_reconnect_waiters.end();
+       ++p)
+    mds->queue_waiters(p->second);


Why is this the right place to requeue any remaining cap flushes? I don't think it has much to do with exporting caps...in fact I don't think we should even have any leftover flushes, should we?

yes, we should not have any leftover flushes

In that case can we put it somewhere less buried?

And perhaps we should print out central log warnings for any unhandled cap flushes rather than just blindly applying them? Not sure, but I generally don't like quiet cleanups of bugs like this.

not quiet. Locker::handle_client_cap will print error message

Mmm, true, but those just get buried in the mds log and don't look like a client reconnect bug. Anything remaining here is a specific, special category of unknown ino, right?

Maybe we don't need anything special flagging it but such things make me nervous.

gregsfortytwo · 2016-01-13T01:58:39Z

I think you're right about this basic scheme being fine. Please remember to describe what testing you've performed when submitting PRs (especially for stable branches!). I know it'll be annoying to set up but we need a test to make sure this behavior doesn't regress in the future.

ukernel · 2016-01-13T03:49:35Z

I run a modified ceph-mds, which does not write journal and crashes on setattr. then run following commands

fstest create testfile 0644
fstest chown testfile 65534 65533
fstest -u 65534 -g 65532 chown testfile 65534 65532
sleep 5
fstest lstat testfile uid,gid

The second chown make the modified ceph-mds crash. Then run unmodified ceph-mds. Without the this fix, 'fstest lstat testfile uid,gid' return 65534,65533, with this fix, it return 65534,65532

gregsfortytwo · 2016-01-13T03:51:14Z

Yeah, we need to automate that test. I'm not sure if we should just add another config kill point or if the timing is certain enough that we can just kill the mds via a ceph-qa-suite task set up to push this process.

During MDS recovers, client may send same cap flushes twice. The first time is when MDS enters reconnect stage (only for flushing caps which are also revoked), the second time is when MDS goes active. If we send cap flushes when enters reconnect stage, we should avoiding sending again. Signed-off-by: Yan, Zheng <zyan@redhat.com>

Client may flush and drop caps at the same time. If client need to send cap reconnect before the caps get flushed. The issued caps in the cap reconnect does not include the flushing caps. When choosing lock states we should consider the flushing caps. The check for caps haven't been flushed is wrong, fix it. Signed-off-by: Yan, Zheng <zyan@redhat.com>

When handling client caps in clientreplay stage, it's possible that corresponding inode does not exist because the client request which creates inode hasn't been replayed. To handle this corner case, we delay handling caps message until corresponding inode is created. Fixes: ceph#14254 Signed-off-by: Yan, Zheng <zyan@redhat.com>

Client re-send cap flush when MDS restarts. The cap flush message may release some caps even if the corresponding flush is already completed. Fixes: ceph#13546 Signed-off-by: Yan, Zheng <zyan@redhat.com>

The option creates long-standing unsafe MDS requests. It help in testing unsafe request related corner cases. Signed-off-by: Yan, Zheng <zyan@redhat.com>

ukernel · 2016-01-15T05:50:29Z

test ceph/ceph-qa-suite#798

…-fs-testing #7199 Reviewed-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2016-01-20T02:12:40Z

http://pulpito.ceph.com/gregf-2016-01-18_18:01:09-fs-greg-fs-testing-118-1---basic-mira/

mds: fix client cap/message replay order on restart Reviewed-by: Greg Farnum <gfarnum@redhat.com>

ukernel added bug-fix cephfs Ceph File System labels Jan 12, 2016

gregsfortytwo assigned gregsfortytwo and ukernel and unassigned gregsfortytwo Jan 12, 2016

gregsfortytwo reviewed Jan 13, 2016
View reviewed changes

ukernel added 2 commits January 13, 2016 15:28

ukernel force-pushed the jewel-14254 branch 3 times, most recently from a750c36 to 23c7923 Compare January 13, 2016 14:24

ukernel added 3 commits January 15, 2016 13:19

mds: fix completed cap flush handling

5d8d666

Client re-send cap flush when MDS restarts. The cap flush message may release some caps even if the corresponding flush is already completed. Fixes: ceph#13546 Signed-off-by: Yan, Zheng <zyan@redhat.com>

mds: add config option to suspend logging

19dc272

The option creates long-standing unsafe MDS requests. It help in testing unsafe request related corner cases. Signed-off-by: Yan, Zheng <zyan@redhat.com>

ukernel force-pushed the jewel-14254 branch from 23c7923 to 19dc272 Compare January 15, 2016 05:38

gregsfortytwo added needs-qa wip-greg-testing labels Jan 18, 2016

gregsfortytwo added a commit that referenced this pull request Jan 18, 2016

Merge branch 'jewel-14254' of git://github.com/ukernel/ceph into greg…

c12d044

…-fs-testing #7199 Reviewed-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo added a commit that referenced this pull request Jan 20, 2016

Merge pull request #7199 from ukernel/jewel-14254

c63ee05

mds: fix client cap/message replay order on restart Reviewed-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo merged commit c63ee05 into ceph:jewel Jan 20, 2016

ghost changed the title ~~Jewel 14254~~ mds: fix client cap/message replay order on restart Feb 10, 2016

ukernel deleted the jewel-14254 branch March 10, 2016 07:50

gregsfortytwo mentioned this pull request Mar 16, 2016

DNM Client: flush dirty caps when forcing sync setattr #7136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds: fix client cap/message replay order on restart #7199

mds: fix client cap/message replay order on restart #7199

ukernel commented Jan 12, 2016

gregsfortytwo Jan 13, 2016

ukernel Jan 13, 2016

gregsfortytwo Jan 13, 2016

ukernel Jan 13, 2016

gregsfortytwo Jan 13, 2016

gregsfortytwo commented Jan 13, 2016

ukernel commented Jan 13, 2016

gregsfortytwo commented Jan 13, 2016

ukernel commented Jan 15, 2016

gregsfortytwo commented Jan 20, 2016

mds: fix client cap/message replay order on restart #7199

mds: fix client cap/message replay order on restart #7199

Conversation

ukernel commented Jan 12, 2016

gregsfortytwo Jan 13, 2016

Choose a reason for hiding this comment

ukernel Jan 13, 2016

Choose a reason for hiding this comment

gregsfortytwo Jan 13, 2016

Choose a reason for hiding this comment

ukernel Jan 13, 2016

Choose a reason for hiding this comment

gregsfortytwo Jan 13, 2016

Choose a reason for hiding this comment

gregsfortytwo commented Jan 13, 2016

ukernel commented Jan 13, 2016

gregsfortytwo commented Jan 13, 2016

ukernel commented Jan 15, 2016

gregsfortytwo commented Jan 20, 2016