Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compound Secondary Index Speed #1526

Closed
linuxdynasty opened this issue Oct 8, 2013 · 41 comments
Closed

Compound Secondary Index Speed #1526

linuxdynasty opened this issue Oct 8, 2013 · 41 comments
Assignees
Milestone

Comments

@linuxdynasty
Copy link

I noticed when I was testing the speed of the queries that I have in my product, there was a significant decreases in certain queries that involved Compound Indexes.

For instance this query took Total time: 0:00:03.790848
eq_join with a compound index without a filter and using distinct()

r.table('apps_per_agent').get_all(['installed', 'default'], index='status_and_customer').eq_join(lambda x: [x['app_id'], 'no'], r.table('unique_applications'), index='appid_and_hidden').pluck({'right': 'app_id'}).distinct().run(conn)

Now this query is eq_join with an index with a filter took Total time: 0:00:01.156872

r.table('apps_per_agent').get_all(['installed', 'default'], index='status_and_customer').eq_join('app_id', r.table('unique_applications')).filter({'right': {'hidden': 'no'}}).pluck({'right': 'app_id'}).distinct().run(conn)

@jdoliner Can add more details to this, since he was working with me on this issue.

I expected the compound index to be faster than the filter when it came to the eq_join, but not in this case.

@jdoliner
Copy link
Contributor

jdoliner commented Oct 8, 2013

I'm sort of at a loss on this one. I did some cursory profiling work with perf top and nothing lit up. @danielmewes is going to do some more in depth profiling.

@ghost ghost assigned danielmewes Oct 8, 2013
@coffeemug
Copy link
Contributor

@linuxdynasty -- approximately how much data is in the DB? This might have to do with disk access rather than CPU.

Also, if you run the queries twice (i.e. run query A, then query B, then query A, then query B), what does the performance look like? It might be that query A warms up the cache so it takes longer, and then query B operates entirely in RAM. What happens if you run them twice (A, B, A, B) or reverse the order (B, A)?

@danielmewes
Copy link
Member

@coffeemug Both @jdoliner and me could reproduce this on the data that @linuxdynasty provided to us. It doesn't seem to be an issue of the cache warming up, and the performance difference is consistent regardless the order.

Current status: Somehow the profiler has stopped doing its job since I've last used it a week ago. I'm trying to get it working again...

@danielmewes
Copy link
Member

Preliminary results after profiling: In the first query, we seem to spend significant time on constructing in-memory secondary index data structures over and over.

More specifically, the rdb_read_visitor_t::operator(const rget_read_t &) in protocol.cc, line 1127 appears to be relatively expensive.
Inside of this operator, most time is spent in the call to acquire_sindex_superblock_for_read(), and in deserializing the ql::wire_func_t for the sindex function. Additional time is spent on destructing the ~reql_func_t() again and in rdb_rget_secondary_slice().

Apart from an analogon to rdb_rget_secondary_slice(), we have none of this overhead when using the primary key in the eq_join.

It seems that the problem here is that we call the operator over and over for each document that we want to join. We should change the code to keep the in-memory structures around instead of deserializing them over and over again for each single document in the eq_join.

@jdoliner
Copy link
Contributor

Interesting, the optimization you suggest is definitely a good one and would fix this problem. How much more time is spent in the deserializing functions in the slow case? It seems weird to me that a slightly more complicated function has any reasonable difference. Also why on earth is the destructor that much more complicated?

@danielmewes
Copy link
Member

I'm not sure if I was clear in my description. The reql_func_t destructor and deserialization I was talking about were both inside of rdb_read_visitor_t::operator(const rget_read_t &). I assume that this is the deserialization / destruction of the sindex_mapping, which is deserialized from disk. Not the function given to eq_join in the query.

PS: So to answer your question: This deserialization does not happen at all in the fast case.

@mlucy
Copy link
Member

mlucy commented Oct 10, 2013

Deserializing and compiling ReQL functions is expensive. We could probably make it better, but I think for most queries you only have to do it a small number of times. If we're currently compiling one function per row, we should almost certainly change that.

@danielmewes
Copy link
Member

If somebody wants to take a closer look:
kcachegrind /home/daniel/1526_db/calltree-1.out on newton for the slow query variant
kcachegrind /home/daniel/1526_db/calltree-2.out on newton for the fast one

@danielmewes
Copy link
Member

@jdoliner: A general question: When we do a get_all on a secondary index, do we have to go to all hash shards? Just wondering if that might also contribute to variant 1's relative slowness.

@jdoliner
Copy link
Contributor

@danielmewes yes. that's a good point actually.

@mlucy
Copy link
Member

mlucy commented Oct 10, 2013

Actually, another thought: if we made get_all variadic, we could change eq_join to send batches rather than one read query per row, which might help with optimization.

@coffeemug
Copy link
Contributor

@mlucy -- why would get_all have to variadic for that to happen?

@danielmewes
Copy link
Member

@mlucy: Funny, I just had the same thought. :-)
I think that would be the best solution. It would also improve the efficiency of eq_join on primary keys.

@mlucy
Copy link
Member

mlucy commented Oct 10, 2013

@coffeemug -- eq_join is a rewrite rule that uses get_all.

@coffeemug
Copy link
Contributor

Ah, I see.

@mlucy
Copy link
Member

mlucy commented Oct 10, 2013

We should also do batching for the other joins at the same time.

@mlucy
Copy link
Member

mlucy commented Oct 10, 2013

Wait, that made no sense at all. Ignore me.

@jdoliner
Copy link
Contributor

... isn't get_all already variadic?

@mlucy
Copy link
Member

mlucy commented Oct 11, 2013

Apparently I'm just grossly misinformed.

@mlucy
Copy link
Member

mlucy commented Oct 11, 2013

@danielmewes, could I take this one?

@danielmewes
Copy link
Member

@mlucy sure, go ahead!

@ghost ghost assigned mlucy Oct 11, 2013
@coffeemug
Copy link
Contributor

@mlucy -- is this be related to #1328? (I.e. just rewriting to get_all doesn't get you much unless get_all does things concurrently).

@mlucy
Copy link
Member

mlucy commented Oct 11, 2013

@coffeemug -- the goal is to send batches, not for parallelism but to avoid incurring per-query overhead (like compiling secondary index functions) for every row.

@mlucy
Copy link
Member

mlucy commented Oct 12, 2013

So, I'm not sure that batches work well here. We use range gets for get_all with an index because there can be arbitrarily many things with that index, so we'd have to send batches of ranges, which smells really bad to me.

@mlucy
Copy link
Member

mlucy commented Oct 15, 2013

The workaround involving batches would require some thought, and might just end up being terrible. If anyone else wants to take this before I get to it, feel free.

@mlucy
Copy link
Member

mlucy commented Oct 17, 2013

Actually, I just had another thought: if we made it so that we only load and compile the function the first time we see a truncated key, then the vast majority of these queries would never need to do it. (We only need the function to disambiguate truncated keys.)

@danielmewes
Copy link
Member

I guess that might be a very easy change which would solve about half of the problem.

Note that acquire_sindex_superblock_for_read() also takes significant CPU time. The profiling suggested that acquire_sindex_superblock_for_read() and compiling+destructing the sindex function are responsible for roughly 30-40% of the CPU time spent in the index lookup each. So fixing any one of them should save ~30%.

@danielmewes
Copy link
Member

(actually not sure about it being a "very easy" change, ignore that. I have no clue how difficult or easy it would actually be)

@coffeemug coffeemug modified the milestones: backlog, subsequent Mar 26, 2014
@coffeemug coffeemug removed the tp:lts label Mar 26, 2014
@StyxOfDynamite
Copy link

hi

@danielmewes
Copy link
Member

Working on this.

@danielmewes danielmewes assigned danielmewes and unassigned mlucy Aug 19, 2015
@danielmewes
Copy link
Member

An optimization for caching secondary index functions is in CR 3171.
I'm going to look into batching multi-key gets next, because it has shown up as a bottleneck in context of plain getAll as well.

@danielmewes danielmewes modified the milestones: 2.2, subsequent Aug 19, 2015
@danielmewes
Copy link
Member

After taking a brief look, I think I might need @mlucy's help on this one...

@danielmewes danielmewes assigned mlucy and unassigned danielmewes Aug 26, 2015
@danielmewes
Copy link
Member

Assigning @mlucy for the multi-key get_all optimization.

@danielmewes danielmewes modified the milestones: 2.1.x, 2.2 Aug 31, 2015
@VeXocide VeXocide assigned VeXocide and unassigned mlucy Sep 1, 2015
@VeXocide
Copy link
Member

VeXocide commented Sep 1, 2015

I'll be taking over the multi-key get_all optimisation.

@mlucy
Copy link
Member

mlucy commented Sep 23, 2015

The get_all optimization is in CR 3240 by @danielmewes .

@mlucy
Copy link
Member

mlucy commented Sep 23, 2015

(It turned out that what we were doing in @VeXocide's branch wasn't actually sufficient for a few edge cases (truncated keys and multiple copies of the same datum in the get_all), and the changes were significant enough I think they warrant a new review.)

@mlucy
Copy link
Member

mlucy commented Sep 23, 2015

It turns out this was branched off of next, so once the review's complete there's still a merge step, but it should be pretty trivial.

@mlucy
Copy link
Member

mlucy commented Oct 1, 2015

This is in next and 2.1.x, will ship with RethinkDB 2.1.5.

@mlucy mlucy closed this as completed Oct 1, 2015
@danielmewes danielmewes modified the milestones: 2.1.x, 2.1.5 Oct 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants