Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative for Large Files #73

Closed
ippes opened this issue Mar 31, 2014 · 4 comments
Closed

Alternative for Large Files #73

ippes opened this issue Mar 31, 2014 · 4 comments

Comments

@ippes
Copy link
Contributor

ippes commented Mar 31, 2014

Thomas proposed a further improvement which might get very popular in our growing fan base. H2H has currently (according to the file configuration) no limitation in file size. However large files can flood the network and use too much resources. An approach could be to provide the functionality which we already implemented in the predecessor project Box2Box, similar to the BTSync approach. H2H would store only meta data of large files. Syncing would be possible only when a node is online which is holding a copy of the large file.

@ippes ippes added this to the Future Work milestone Mar 31, 2014
@nicoruti
Copy link
Contributor

nicoruti commented Apr 1, 2014

This is exactly what I though of. We don't need to change much to support this. Some clarification to point out the difference to small files:

MetaFile (small files)

  • Each small file has a unique id (location key of meta file)
  • Has a list of versions.
  • Each version has a list of key pairs to locate and decrypt the chunks.

LargeMetaFile (large files)

  • Each large file has a unique id (location key of meta file)
  • Has a list of versions (optionally).
  • Each version has a list of chunk-stubs
  • Each chunk stub contains an offset and a hash (MD5)

Clients having access to this file and currently being online are contacted with the request for a single chunk of the large file. The request contains the following:

  • file id
  • version id
  • offset
  • number of bytes to read (or take the file chunk size configured in the file configuration)

The requested client checks its filesystem for the existence of the file. If the file exists, the client reads from the offset for the given length and returns that file-part. He sends a response to the client containing to following attributes:

  • file id
  • byte[] data

The requester can verify the correctness of the file-part using the hash in the meta file.
Note that the client needs to ensure that the file chunks are concatenated in the correct order (same as when downloading a small file from the DHT).

What I don't know yet is whether we need versioning at large files. This can decrease the sync performance dramatically: Not only clients need to be online at the same time, but they also need to have the same version.

An improvement of the solution above is that the client stores the file-parts at a temporary location. He can then go offline and continue fetching the rest of the file-parts at another moment in time without re-fetching all file-parts again.

We need to ensure that every operation works at file-part level and it's never necessary to have the full fill in-memory (which would cause troubles when handling large files). BTW: We did already consider this when syncing small files.

@0vermind
Copy link

0vermind commented Apr 3, 2014

What I don't know yet is whether we need versioning at large files.

I vote for not having it.

Note that the client needs to ensure that the file chunks are concatenated in the correct order (same as when downloading a small file from the DHT).

But will be possible to downloads chunks not in order?

May I ask why not using an already made protocol like torrent to manage chunks and downloads?

@nicoruti
Copy link
Contributor

nicoruti commented Apr 4, 2014

Thanks, @0vermind for the comment. I think, too, that large files should not be versioned in the first iteration.

We already implemented a torrent-like protocol, which downloads file parts and assembles them as soon as all chunks are available. The implementation is done by ourself, not relying on an external library. Later on, we could consider using an optimized protocol supporting for example also compression or erasure coding (see #56).
As this split and reassembly functionality is reused in multiple use cases, we should pull it out of the current code and refactor it to make it reusable.

To your question, whether it will be possible to download chunks not in order: Yes, this is possible. It even goes one step further, allowing a parallel download of the chunks (same as in torrent).

@nicoruti
Copy link
Contributor

A basic implementation for the support of large files has been done. Every download of a file is now handled by the DownloadManager. Tasks can be submitted to the manager. A task may be downloading a file from the DHT or to download it directly from another user (in case of a large file).
Downloads of chunks of a single file can run in parallel. Every chunk is stored into a temporary file, which is assembled if all file parts are available. It is also possible to download multiple files at the same time.
Currently, a constant (25) defines, how many downloads can happen concurrently. Other downloads need to wait until a free slot is available.

When the user logs out, active downloads are serialized and stored. When he logs in again, the downloads can be continued, without re-downloading all parts. This has not been heavily tested and needs further improvements (the synchronization step during the login does not yet know that the download is already running and may initialize another download).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants