New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor aggregations to use lucene5-style collectors. #9544
Conversation
Aggregators now return a new collector instance per segment, like Lucene 5 does with its oal.search.Collector API. This is important for us because things like knowing whether the field is single or multi-valued is only known at a segment level. In order to do that I had to change aggregators to notify their sub aggregators of new incoming segments (pretty much in the spirit of elastic#6477) while everything used to be centralized in the AggregationContext class. While this might slow down a bit deeply nested aggregation trees, this also makes the children aggregation and the `breadth_first` collection mode much better options since they can now only replay what they need while they used to have to replay the whole aggregation tree. I also took advantage of this big refactoring to remove some abstractions that were not really required like ValuesSource.MetaData or BucketAnalysisCollector. I also splitted Aggregator into Aggregator and AggregatorBase in order to separate the Aggregator API from implementation helpers.
bcff0d2
to
0652cfd
Compare
* An Aggregator. | ||
*/ | ||
// IMPORTANT: DO NOT add methods to this class unless strictly required. | ||
// On the other hand, if you can remove methods from it, you are highly welcome! | ||
public abstract class Aggregator extends BucketCollector implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exception of the descendsFromBucketAggregator
method, which could probably be moved to AggregatorBase as it's an implementation detail and not integral to the idea of an Aggregator, all the methods in this class are abstract. Should we make Aggregator an interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. FYI I think descendsFromBucketAggregator is ok since it is static.
@jpountz LGTM, but is it worth doing a benchmark test to determine how much performance is affected before and after this change for multi-bucket aggs (e.g. terms agg) with 1, 2, and 3 levels of nested multi-bucket aggs? |
@colings86 I built a very simple benchmark: 5 shards, 2M docs and 3 fields: one string (s, cardinality=1000), two longs (l1, cardinality=100 and l2, cardinality=10000).
As far as I am concerned, I think it is good enough? I would like to emphasize that this change is important is we want to be able to sometimes use deferred collection by default as this collection mode does not need to move the whole context anymore in order to replay. |
@jpountz yeah, I think its good enough too. I agree that this change is important I just wanted to ensure that the performance hit wasn't too bad and that we at least have an idea on what the performance hit might be. I'm +1 for merging into master |
The refactoring in elastic#9544 introduced a regression that broke multi-level aggregations using breadth-first. This was due to sub-aggregators creating deferred collectors before their parent aggregator and then the parent aggregator trying to collect sub aggregators directly instead of going through the deferred wrapper. This commit fixes the issue but we should try to simplify all the pre/post collection logic that we have. Also `breadth_first` is now automatically ignored if the sub aggregators need scores (just like we ignore `execution_mode` when the value does not make sense like using ordinals on a script). Close elastic#9823
The refactoring in elastic#9544 introduced a regression that broke multi-level aggregations using breadth-first. This was due to sub-aggregators creating deferred collectors before their parent aggregator and then the parent aggregator trying to collect sub aggregators directly instead of going through the deferred wrapper. This commit fixes the issue but we should try to simplify all the pre/post collection logic that we have. Also `breadth_first` is now automatically ignored if the sub aggregators need scores (just like we ignore `execution_mode` when the value does not make sense like using ordinals on a script). Close elastic#9823
The refactoring in elastic#9544 introduced a regression that broke multi-level aggregations using breadth-first. This was due to sub-aggregators creating deferred collectors before their parent aggregator and then the parent aggregator trying to collect sub aggregators directly instead of going through the deferred wrapper. This commit fixes the issue but we should try to simplify all the pre/post collection logic that we have. Also `breadth_first` is now automatically ignored if the sub aggregators need scores (just like we ignore `execution_mode` when the value does not make sense like using ordinals on a script). Close elastic#9823
Aggregators now return a new collector instance per segment, like Lucene 5 does
with its oal.search.Collector API. This is important for us because things like
knowing whether the field is single or multi-valued is only known at a segment
level.
In order to do that I had to change aggregators to notify their sub aggregators
of new incoming segments (pretty much in the spirit of #6477) while everything
used to be centralized in the AggregationContext class. While this might slow
down a bit deeply nested aggregation trees, this also makes the children
aggregation and the
breadth_first
collection mode much better options sincethey can now only replay what they need while they used to have to replay the
whole aggregation tree.
I also took advantage of this big refactoring to remove some abstractions that
were not really required like ValuesSource.MetaData or BucketAnalysisCollector.
I also splitted Aggregator into Aggregator and AggregatorBase in order to
separate the Aggregator API from implementation helpers.
Closes #9098