Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data disappeared overnight #3717

Closed
nuin opened this issue Feb 3, 2015 · 16 comments
Closed

Data disappeared overnight #3717

nuin opened this issue Feb 3, 2015 · 16 comments
Labels
Milestone

Comments

@nuin
Copy link

nuin commented Feb 3, 2015

All data on all my tables disappeared overnight, without any action or problem on the server host. I am running a Debian box, with rethinkdb 1.15.3~0wheezy (GCC 4.7.2), that was working fine since January 13th, with a quick shutdown on January 27th. No table was created or modified lately, only actions were data retrieval. Here is the log I have since I updated to version 1.15.3 on Jan 13th

2015-01-13T16:14:32.841158750 0.100764s info: Running rethinkdb 1.15.3~0wheezy (GCC 4.7.2)...
2015-01-13T16:14:32.845317362 0.104923s info: Running on Linux 3.2.0-4-amd64 x86_64
2015-01-13T16:14:32.845416742 0.105022s info: Using cache size of 100 MB
2015-01-13T16:14:32.845629016 0.105234s warn: Cache size does not leave much memory for server and query overhead (available memory: 216 MB).
2015-01-13T16:14:32.845673287 0.105278s warn: Cache size is very low and may impact performance.
2015-01-13T16:14:32.845702362 0.105307s info: Loading data from directory /var/lib/rethinkdb/default
2015-01-13T16:14:33.757424347 1.017029s info: Listening for intracluster connections on port 29015
2015-01-13T16:14:33.758265697 1.017871s info: Listening for client driver connections on port 28015
2015-01-13T16:14:33.758457092 1.018062s info: Listening for administrative HTTP connections on port 8080
2015-01-13T16:14:33.759964532 1.019569s info: Listening on addresses: 127.0.0.1, 192.168.1.83, ::1, fe80::213:72ff:fe1d:5eda%2
2015-01-13T16:14:33.759974299 1.019579s info: Server ready
2015-01-13T16:14:58.870881493 26.130486s info: Server got SIGINT from pid 783, uid 0; shutting down...
2015-01-13T16:14:58.871083899 26.130689s info: Shutting down client connections...
2015-01-13T16:14:58.871403944 26.131009s info: All client connections closed.
2015-01-13T16:14:58.871419151 26.131024s info: Shutting down storage engine... (This may take a while if you had a lot of unflushed data in the writeback cache.)
2015-01-13T16:14:58.871423666 26.131028s info: Storage engine shut down.
2015-01-27T13:35:50.476492926 6.277984s info: Running rethinkdb 1.15.3~0wheezy (GCC 4.7.2)...
2015-01-27T13:35:50.511900653 6.313392s info: Running on Linux 3.2.0-4-amd64 x86_64
2015-01-27T13:35:50.511990025 6.313481s info: Using cache size of 1356 MB
2015-01-27T13:35:50.512194999 6.313686s info: Loading data from directory /var/lib/rethinkdb/default
2015-01-27T13:35:58.629360326 14.430851s info: Listening for intracluster connections on port 29015
2015-01-27T13:36:00.191177816 15.992669s info: Listening for client driver connections on port 28015
2015-01-27T13:36:00.191413534 15.992904s info: Listening for administrative HTTP connections on port 8080
2015-01-27T13:36:00.192938403 15.994429s info: Listening on addresses: 127.0.0.1, 192.168.1.83, ::1, fe80::213:72ff:fe1d:5eda%2
2015-01-27T13:36:00.192942170 15.994433s info: Server ready
2015-02-03T09:22:50.337462823 589626.138954s info: Server got SIGINT from pid 31406, uid 0; shutting down...
2015-02-03T09:22:50.337628372 589626.139120s info: Shutting down client connections...
2015-02-03T09:22:50.360407536 589626.161899s info: All client connections closed.
2015-02-03T09:22:50.360415942 589626.161907s info: Shutting down storage engine... (This may take a while if you had a lot of unflushed data in the writeback cache.)
2015-02-03T09:22:50.399284094 589626.200775s info: Storage engine shut down.
2015-02-03T09:22:56.998502949 0.103864s info: Running rethinkdb 1.15.3~0wheezy (GCC 4.7.2)...
2015-02-03T09:22:57.003100342 0.108462s info: Running on Linux 3.2.0-4-amd64 x86_64
2015-02-03T09:22:57.003201507 0.108562s info: Using cache size of 174 MB
2015-02-03T09:22:57.003415293 0.108776s info: Loading data from directory /var/lib/rethinkdb/default
2015-02-03T09:22:57.308257603 0.413618s info: Listening for intracluster connections on port 29015
2015-02-03T09:22:57.389390003 0.494751s info: Listening for client driver connections on port 28015
2015-02-03T09:22:57.389685721 0.495046s info: Listening for administrative HTTP connections on port 8080
2015-02-03T09:22:57.391257387 0.496618s info: Listening on addresses: 127.0.0.1, 192.168.1.83, ::1, fe80::213:72ff:fe1d:5eda%2
2015-02-03T09:22:57.391267786 0.496628s info: Server ready
2015-02-03T09:26:15.391612287 198.496973s info: Server got SIGINT from pid 31790, uid 0; shutting down...
2015-02-03T09:26:15.391864009 198.497225s info: Shutting down client connections...

Any help appreciated.

@mglukhovsky
Copy link
Member

@nuin: sorry you ran into this, we'll do our best to sort it out. Can you describe a few details about your server, namely:

  • how much memory is on your server
  • the output of ls -lh /var/lib/rethinkdb/default

@danielmewes may ask for other details that will help us understand what's going on.

@nuin
Copy link
Author

nuin commented Feb 3, 2015

Hi @mglukhovsky, our server is small and it has 4Gb of memory, only a couple of tables are mainly active, with maybe 10-20 queries day.

The output of ls won't help you at the moment. I had to update to the newer version in order to be able to load data from a backup as we still don't have a backup hardware system. But I remember listing the contents of the directory and seeing the same files as I had before the data disappearance, but all files had a smaller size than before.

I updated to version rethinkdb 1.16.0+1~0wheezy, including the Python module. Data load was fine and quick, but there's an issue on the Dashboard, as all tables appear empty. Checking on the data explorer, data seems fine.

@danielmewes
Copy link
Member

Hi @nuin , your log file lists the last restart for 2015-02-03 09:22. Was that the one after which your data had disappeared?
Did the data disappear after you had restarted the server or while the server was still running? If you restarted the server: Did you restart through the init script (e.g. /etc/init.d/rethinkdb restart) or did you start RethinkDB manually on the command line (i.e. with the rethinkdb command)?

When the data wasn't there, did the tables still exist and were simply empty? Or were the tables gone as well?
Did you check whether secondary indexes still existed by any chance?

With the old data directory being lost, it's going to be pretty difficult to figure out how this happened exactly. We're going to think through any possibilities though.

@danielmewes danielmewes added this to the 1.16.x milestone Feb 3, 2015
@nuin
Copy link
Author

nuin commented Feb 3, 2015

Hi @danielmewes

The chronology of the events:

  • DB was restarted on Jan 27th. I moved my RethinkDB server to another room
  • Access was normal until yesterday
  • Around 9 am today I received a SysAid with an error on output of one of the apps that access the DB
  • Restarted the server after that and then did the update

When I checked this morning, all databases and tables were there, but all of them reported 0 documents, I tried the data explorer on the web server and nothing was returned from past history command.

No secondary indexes were available at the time I checked. On one of the tables, I had created a secondary index.

Sorry, that I overwrote the old data directory. But as this is our only DB server, I couldn't just point the queries to another box. I am lobbying for new hardware, so we have some redundancy. I will keep daily dumps, and in my case I can recreate the data from scripts, the only problem is that it takes some hours from scratch data.

I reported the error just in case someone runs into a similar situation, as it might help them and you to figure out. I use RethinkDB in my production apps, albeit all my programs run on a intranet, and I am quite happy with the performance and user-friendly features.

@danielmewes
Copy link
Member

Thanks a lot for your additional info.

Beginning with the restart in step 4, the observed behavior could stem from something deleting the table files (they have UUIDs as their file names) but not the metadata. After restarting, the RethinkDB server would have recreated the table files with empty data. Since secondary indexes are stored in the table files and not the metadata, they would have disappeared together with the data.
I think we should implement #1780 asap so at least we can get some indication if anything like this happens again whether RethinkDB deleted the files itself.

One thing that doesn't quite fit with that explanation though is that the application reported DB errors before restarting the server. If some process had deleted the files, the server would still have had the old files open and shouldn't have generated any query failures until restarted.

Maybe someone else has another idea (pinging @timmaxw @Tryneus ).

@nuin
Copy link
Author

nuin commented Feb 3, 2015

@danielmewes There were no real errors on the queries, they were just returned empty. The user was expecting a value X for an absolute value, and the program was showing 0. Sorry if I wasn't clear in my explanation.

@danielmewes
Copy link
Member

@nuin Ah thanks for your clarification. That's probably still an indication that the tables had suddenly become empty in a running server isn't it?

@nuin
Copy link
Author

nuin commented Feb 3, 2015

@danielmewes As I mentioned, we don't have many queries every day, and the last successful query (before the disappearance) was yesterday afternoon.

@coffeemug
Copy link
Contributor

@nuin -- one more question, which file system are you using?

@nuin
Copy link
Author

nuin commented Feb 3, 2015

@coffeemug RethinkDB is on a Debian: 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u2 x86_64 GNU/Linux. I was running 1.15.3 before the problem, just updated to 1.16.

@danielmewes
Copy link
Member

@nuin can you post the output of mount? Are you using ext4 or BTRFS (or something else)?

@nuin
Copy link
Author

nuin commented Feb 3, 2015

@danielmewes Here it is

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,relatime,size=10240k,nr_inodes=497694,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=399432k,mode=755)
/dev/mapper/MDLPortal-root on / type ext4 (rw,relatime,errors=remount-ro,user_xattr,barrier=1,data=ordered)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /run/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=798860k)
/dev/sda1 on /boot type ext2 (rw,relatime,errors=continue)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)

I have a couple of other Windows disks mounted on /mnt using cifs, but they are read only and not used for writing in this box

@danielmewes
Copy link
Member

Thanks @nuin . It looks like the data files would have been on ext4. That should be fine. We've had some issues with BTRFS, but I don't think we've had any lately with ext4.

@danielmewes
Copy link
Member

@nuin We just released RethinkDB 1.16.1, which now writes a log entry every time it deletes a file (as per #1780). I recommend updating, so that we get some additional data for the case this happens to you again.
Sorry we can't do more at the moment, but it's difficult to debug without having the data files.

@danielmewes
Copy link
Member

Unfortunately I don't think there's anything we can do here at the moment, since we don't have the data and no additional hints as to why this could have happened.

I'm going to close the issue, but we'll keep an eye on it and re-open as soon as we get any new data on this (@nuin also please re-open if anything remotely close to this happens again).

@danielmewes danielmewes removed this from the 1.16.x milestone Feb 13, 2015
@nuin
Copy link
Author

nuin commented Feb 13, 2015

Thanks, I will keep an eye on my side and updating things this coming week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants