Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changefeeds sometimes erroneously believe a table is resharding #4838

Closed
mlucy opened this issue Sep 14, 2015 · 11 comments
Closed

Changefeeds sometimes erroneously believe a table is resharding #4838

mlucy opened this issue Sep 14, 2015 · 11 comments
Assignees
Milestone

Comments

@mlucy
Copy link
Member

mlucy commented Sep 14, 2015

This was reported in the Addimation channel. They're getting the Unable to retrieve the start stamps. Did you just reshard? error message long after resharding is done. There are multiple possible causes for this, but the most likely right now seems to be a mistake in our error handling if your first changefeed subscription to a table happens just as the table reshards. There are probably three things to do here:

  • Improve the error reporting in this case. There are multiple code paths which produce the same error which could instead produce different ones.
  • Visually inspect the logic I mentioned above to see if it's wrong.
  • Try to reproduce the bug.
@mlucy mlucy added the tp:bug label Sep 14, 2015
@mlucy mlucy added this to the 2.1.x milestone Sep 14, 2015
@weshoke
Copy link

weshoke commented Sep 14, 2015

Here's a reconstructed timeline of what happened:

  1. Upgraded to 2.1.3 from 2.0.4 on Friday 9pm
  2. Followed the rebuild secondary index instructions http://www.rethinkdb.com/docs/troubleshooting/#my-secondary-index-is-outdated Sunday morning
  3. Noticed the Unable to retrieve the start stamps. Did you just reshard? errors soon after
  4. The errors persisted until the next day.
  5. Will soon reconfigure the table and see if that solves the error

@mlucy
Copy link
Member Author

mlucy commented Sep 18, 2015

Note to self: an easy patch for this would probably be to just remove the entry in the changefeed_client_t for a particular table whenever we get this response for that table, since we know it's in an invalid state.

@danielmewes
Copy link
Member

Since this has happened again to another user, putting in that work-around sounds like a good idea to me if we can't find out why this is actually happening.

@mlucy
Copy link
Member Author

mlucy commented Sep 18, 2015

I think we know roughly why it's happening. It would still be good to try and reproduce it though.

@williamstein
Copy link

(I'm the other user mentioned above.) For me this error message appears every time I do any resharding; fortunately I've always been able to resolve it by restarting every database server process (restarting clients has no effect). This bug means that any resharding of the tables in my rethinkdb cluster involves downtime, which is annoying for a rapidly growing operational website. Thanks for looking into this!

@mlucy
Copy link
Member Author

mlucy commented Sep 28, 2015

I'm having some trouble reproducing this, which is weird since @williamstein said it happens every time he does resharding. I'm trying to reproduce it by making a two-node cluster, opening a few hundred changefeed subscriptions per second, and then either resharding or kill -9ing one of the servers.

A few possibilities off the top of my head:

  • It's timing-related, and I need either heavier load or more network latency to catch it.
  • It only happens with a particular type of changefeed, or when changefeeds are used in conjunction with some other feature.
  • It only shows up for some platforms/configuraitons. (@williamstein, you're running on Linux, right?)

Tracking down the root source of this will probably take a while, so in the meantime I think we should just do the quick patch so that people experiencing the issue can still reshard or restart a single node under load.

@williamstein
Copy link

mlucy -- send me an email at wstein@sagemath.com and I could set things up so you can replicate this on a clone of the VM I was using. I was using Ubuntu 15.04, three rethinkdb processes (all on the same machine), etc.

@mlucy
Copy link
Member Author

mlucy commented Sep 29, 2015

I've been looking into this a little bit more, and I think I found the problem (by code inspection). Currently putting together a fix for the problem + the patch we discussed to automatically reset the feed if this error occurs again.

@williamstein -- a testing environment would be great, thanks! I'll send you an email.

@mlucy
Copy link
Member Author

mlucy commented Sep 29, 2015

A fix to what I think is the problem + the retrying patch mentioned above is in CR 3256 by @danielmewes .

@danielmewes danielmewes modified the milestones: 2.1.5, 2.1.x Oct 1, 2015
@mlucy
Copy link
Member Author

mlucy commented Oct 2, 2015

The changes that I think fix this are in next and 2.1.x.

@danielmewes
Copy link
Member

@williamstein tested a pre-release binary with the fix and was able to confirm that the issue no longer occurred. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants