New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Our main memory increases relatively fast with the database size #1951
Comments
A third relatively easy option is to reduce the amount of stored information by making the following optimizations:
That would easily cut the overhead in half. |
Not that this may not be the case forever, so I wouldn't count on this functionality. There is a lot of demand for incremental document updates (i.e. incrementing a counter in a large doc shouldn't affect the whole doc). We can get away with things as they are for now, but I don't think we'll be able to get away with it forever. |
@coffeemug: That's ok though. We can still update the recency of the leaf node whenever we (partially) modify a blob. We just couldn't selectively backfill only a part of the blob. (Edit: My initial reply missed the point. Replaced by something better) |
Working on this now. I'm keeping the on-disk format unchanged, so that we can easily revert any of these changes later if for example we want to support blocks of more than 64 KB at some point. |
These changes should get the worst-case overhead from ~5% down to ~2%. |
I have a working version in branch This change has a slightly strange interaction with existing tables: |
In CR 3241 by @Tryneus . |
Merged into |
(The title should be: "Our main memory overhead increases relatively fast with the database size")
For each block of a table (no matter whether it is on disk or in the cache), we keep an LBA entry in main memory. That entry is defined in
src/serializer/log/lba/in_memory_index.hpp
. It containsand has a size of 20 bytes.
In addition, the alt cache duplicates the recency (which is a 64 bit value) of every block, leading to an overhead of 28 bytes per block.
Assuming an average value size of between 250 and 512 bytes (which is the worst case, but probably not untypical), we use at least 1 block per 512 bytes of data.
The overhead then is as follows:
28 bytes / 512 bytes = 0.05...
.For storing a quite realistic 1 TB of data, this meta-data alone requires 50 GB of main memory (and that doesn't allow for any cache yet, it's just "lost" space). 10 TB of mass storage on a single node are becoming common, but 500+ GB of main memory are not.
I think we will sooner or later have to come up with a way to reduce this overhead.
One very easy option would be to put these memory structures into a memory-mapped file if the database is over a certain size. Then the operating system could swap out parts of the index.
Here's another (probably much faster) idea, but it isn't quite worked-out yet:
In the absence of garbage collection, we could have blobs store physical file offsets rather than logical block ids to reference blocks. So blobs (large values > 250 bytes specifically) would not generate any entries in the LBA. This would work well, because whenever we modify a blob, we just rewrite it completely anyway. Thus we don't win anything from having the additional indirection of translating logical to physical block offsets through the LBA in the first place.
A problem is garbage collection though, because it changes the physical location of a block in the file without going through the btree. I'm not quite sure about how that could be integrated yet.
The text was updated successfully, but these errors were encountered: