Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our main memory increases relatively fast with the database size #1951

Closed
danielmewes opened this issue Feb 17, 2014 · 8 comments
Closed

Our main memory increases relatively fast with the database size #1951

danielmewes opened this issue Feb 17, 2014 · 8 comments
Assignees
Milestone

Comments

@danielmewes
Copy link
Member

(The title should be: "Our main memory overhead increases relatively fast with the database size")

For each block of a table (no matter whether it is on disk or in the cache), we keep an LBA entry in main memory. That entry is defined in src/serializer/log/lba/in_memory_index.hpp. It contains

flagged_off64_t offset;
repli_timestamp_t recency;
uint32_t ser_block_size;

and has a size of 20 bytes.

In addition, the alt cache duplicates the recency (which is a 64 bit value) of every block, leading to an overhead of 28 bytes per block.

Assuming an average value size of between 250 and 512 bytes (which is the worst case, but probably not untypical), we use at least 1 block per 512 bytes of data.
The overhead then is as follows: 28 bytes / 512 bytes = 0.05....

For storing a quite realistic 1 TB of data, this meta-data alone requires 50 GB of main memory (and that doesn't allow for any cache yet, it's just "lost" space). 10 TB of mass storage on a single node are becoming common, but 500+ GB of main memory are not.

I think we will sooner or later have to come up with a way to reduce this overhead.

One very easy option would be to put these memory structures into a memory-mapped file if the database is over a certain size. Then the operating system could swap out parts of the index.

Here's another (probably much faster) idea, but it isn't quite worked-out yet:
In the absence of garbage collection, we could have blobs store physical file offsets rather than logical block ids to reference blocks. So blobs (large values > 250 bytes specifically) would not generate any entries in the LBA. This would work well, because whenever we modify a blob, we just rewrite it completely anyway. Thus we don't win anything from having the additional indirection of translating logical to physical block offsets through the LBA in the first place.
A problem is garbage collection though, because it changes the physical location of a block in the file without going through the btree. I'm not quite sure about how that could be integrated yet.

@danielmewes danielmewes added this to the backlog milestone Feb 17, 2014
@danielmewes
Copy link
Member Author

A third relatively easy option is to reduce the amount of stored information by making the following optimizations:

  • Don't store recency for blob blocks. They can just use the recency of the btree leaf node that they are attached to. Internally we would probably have two LBA lists in the serializer, one for blocks that have a recency and one for blocks that don't. We would partition the space of block ids into something like two halves, to identify which block id belongs where (and the cache would assign block ids from the different pools accordingly).
  • ser_block_size doesn't need to be 32 bit. 16 bit would probably be enough, maybe 24 bit.
  • flagged_off64_t also doesn't need the full 64 bit.

That would easily cut the overhead in half.

@coffeemug coffeemug modified the milestones: 2.x, backlog Feb 17, 2014
@coffeemug
Copy link
Contributor

This would work well, because whenever we modify a blob, we just rewrite it completely anyway.

Not that this may not be the case forever, so I wouldn't count on this functionality. There is a lot of demand for incremental document updates (i.e. incrementing a counter in a large doc shouldn't affect the whole doc). We can get away with things as they are for now, but I don't think we'll be able to get away with it forever.

@danielmewes
Copy link
Member Author

@coffeemug: That's ok though. We can still update the recency of the leaf node whenever we (partially) modify a blob. We just couldn't selectively backfill only a part of the blob.

(Edit: My initial reply missed the point. Replaced by something better)

@danielmewes
Copy link
Member Author

Working on this now.
I'm cutting out the replication timestamp for blob blocks, which saves 16 bytes per such block (8 bytes in the in-memory LBA, and 8 bytes in the buffer cache).
I'm also reducing the block size from 32 to 16 bit. We currently only use block sizes up to 4 KB, so this is still plenty.

I'm keeping the on-disk format unchanged, so that we can easily revert any of these changes later if for example we want to support blocks of more than 64 KB at some point.

@danielmewes danielmewes self-assigned this Sep 5, 2015
@danielmewes
Copy link
Member Author

These changes should get the worst-case overhead from ~5% down to ~2%.

@danielmewes danielmewes modified the milestones: 2.2-polish, 2.2 Sep 5, 2015
@danielmewes
Copy link
Member Author

I have a working version in branch daniel_1951.
The branch also includes a few unrelated improvements to the cache's memory usage accounting, that make the actually used cache memory remain significantly closer to the specified cache size (though it's still not quite identical).

This change has a slightly strange interaction with existing tables:
With the change, the in-memory copy of the LBA now has two arrays from block ID's to block infos. One for regular blocks (IDs starting at 0), and one for blob blocks (IDs now starting at 2^63). The array for blob blocks uses the more compact structure that omits the recency timestamp.
Every newly written blob is going to use block IDs from the new ID range for blobs, and is going to end up in the blob array. As more and more values of a table get rewritten, most of the block IDs used for blobs are going to be moved over into the blob LBA array.
The weird side effect of this is that the regular blob array doesn't actually shrink, as long as there's at least one block ID close to the end of the block range that's still in use by a regular block. So even if values in an existing table are getting rewritten, this change is not actually going to save as much memory as one might expect. (Though it will still save roughly 8 bytes per block due to an additional optimization.)
The change is therefore going to be most effective for new tables, or for tables that are still growing in size due to inserts.

@danielmewes
Copy link
Member Author

In CR 3241 by @Tryneus .

@danielmewes danielmewes added this to the 2.2 milestone Oct 15, 2015
@danielmewes danielmewes removed this from the 2.2-polish milestone Oct 15, 2015
@danielmewes
Copy link
Member Author

Merged into next cfca276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants