New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changefeeds sometimes erroneously believe a table is resharding #4838
Comments
Here's a reconstructed timeline of what happened:
|
Note to self: an easy patch for this would probably be to just remove the entry in the |
Since this has happened again to another user, putting in that work-around sounds like a good idea to me if we can't find out why this is actually happening. |
I think we know roughly why it's happening. It would still be good to try and reproduce it though. |
(I'm the other user mentioned above.) For me this error message appears every time I do any resharding; fortunately I've always been able to resolve it by restarting every database server process (restarting clients has no effect). This bug means that any resharding of the tables in my rethinkdb cluster involves downtime, which is annoying for a rapidly growing operational website. Thanks for looking into this! |
I'm having some trouble reproducing this, which is weird since @williamstein said it happens every time he does resharding. I'm trying to reproduce it by making a two-node cluster, opening a few hundred changefeed subscriptions per second, and then either resharding or A few possibilities off the top of my head:
Tracking down the root source of this will probably take a while, so in the meantime I think we should just do the quick patch so that people experiencing the issue can still reshard or restart a single node under load. |
mlucy -- send me an email at wstein@sagemath.com and I could set things up so you can replicate this on a clone of the VM I was using. I was using Ubuntu 15.04, three rethinkdb processes (all on the same machine), etc. |
I've been looking into this a little bit more, and I think I found the problem (by code inspection). Currently putting together a fix for the problem + the patch we discussed to automatically reset the feed if this error occurs again. @williamstein -- a testing environment would be great, thanks! I'll send you an email. |
A fix to what I think is the problem + the retrying patch mentioned above is in CR 3256 by @danielmewes . |
The changes that I think fix this are in |
@williamstein tested a pre-release binary with the fix and was able to confirm that the issue no longer occurred. Closing. |
This was reported in the Addimation channel. They're getting the
Unable to retrieve the start stamps. Did you just reshard?
error message long after resharding is done. There are multiple possible causes for this, but the most likely right now seems to be a mistake in our error handling if your first changefeed subscription to a table happens just as the table reshards. There are probably three things to do here:The text was updated successfully, but these errors were encountered: