Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor aggregations to use lucene5-style collectors. #9544

Closed
wants to merge 1 commit into from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 3, 2015

Aggregators now return a new collector instance per segment, like Lucene 5 does
with its oal.search.Collector API. This is important for us because things like
knowing whether the field is single or multi-valued is only known at a segment
level.

In order to do that I had to change aggregators to notify their sub aggregators
of new incoming segments (pretty much in the spirit of #6477) while everything
used to be centralized in the AggregationContext class. While this might slow
down a bit deeply nested aggregation trees, this also makes the children
aggregation and the breadth_first collection mode much better options since
they can now only replay what they need while they used to have to replay the
whole aggregation tree.

I also took advantage of this big refactoring to remove some abstractions that
were not really required like ValuesSource.MetaData or BucketAnalysisCollector.
I also splitted Aggregator into Aggregator and AggregatorBase in order to
separate the Aggregator API from implementation helpers.

Closes #9098

Aggregators now return a new collector instance per segment, like Lucene 5 does
with its oal.search.Collector API. This is important for us because things like
knowing whether the field is single or multi-valued is only known at a segment
level.

In order to do that I had to change aggregators to notify their sub aggregators
of new incoming segments (pretty much in the spirit of elastic#6477) while everything
used to be centralized in the AggregationContext class. While this might slow
down a bit deeply nested aggregation trees, this also makes the children
aggregation and the `breadth_first` collection mode much better options since
they can now only replay what they need while they used to have to replay the
whole aggregation tree.

I also took advantage of this big refactoring to remove some abstractions that
were not really required like ValuesSource.MetaData or BucketAnalysisCollector.
I also splitted Aggregator into Aggregator and AggregatorBase in order to
separate the Aggregator API from implementation helpers.
* An Aggregator.
*/
// IMPORTANT: DO NOT add methods to this class unless strictly required.
// On the other hand, if you can remove methods from it, you are highly welcome!
public abstract class Aggregator extends BucketCollector implements Releasable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception of the descendsFromBucketAggregator method, which could probably be moved to AggregatorBase as it's an implementation detail and not integral to the idea of an Aggregator, all the methods in this class are abstract. Should we make Aggregator an interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. FYI I think descendsFromBucketAggregator is ok since it is static.

@colings86
Copy link
Contributor

@jpountz LGTM, but is it worth doing a benchmark test to determine how much performance is affected before and after this change for multi-bucket aggs (e.g. terms agg) with 1, 2, and 3 levels of nested multi-bucket aggs?

@jpountz
Copy link
Contributor Author

jpountz commented Feb 12, 2015

@colings86 I built a very simple benchmark: 5 shards, 2M docs and 3 fields: one string (s, cardinality=1000), two longs (l1, cardinality=100 and l2, cardinality=10000).

Terms agg Master PR
s 672 473
l1 808 783
l2 1132 1046
s>l1 1945 2011
s>l1>l2 15865 17769
s(deferred)>l1>l2 1272 1208
s>l1(deferred)>l2 3150 3095

As far as I am concerned, I think it is good enough? I would like to emphasize that this change is important is we want to be able to sometimes use deferred collection by default as this collection mode does not need to move the whole context anymore in order to replay.

@colings86
Copy link
Contributor

@jpountz yeah, I think its good enough too. I agree that this change is important I just wanted to ensure that the performance hit wasn't too bad and that we at least have an idea on what the performance hit might be.

I'm +1 for merging into master

@jpountz jpountz closed this in de41981 Feb 12, 2015
@jpountz jpountz removed the review label Feb 12, 2015
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 3, 2015
The refactoring in elastic#9544 introduced a regression that broke multi-level
aggregations using breadth-first. This was due to sub-aggregators creating
deferred collectors before their parent aggregator and then the parent
aggregator trying to collect sub aggregators directly instead of going through
the deferred wrapper.

This commit fixes the issue but we should try to simplify all the pre/post
collection logic that we have.

Also `breadth_first` is now automatically ignored if the sub aggregators need
scores (just like we ignore `execution_mode` when the value does not make sense
like using ordinals on a script).

Close elastic#9823
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 9, 2015
The refactoring in elastic#9544 introduced a regression that broke multi-level
aggregations using breadth-first. This was due to sub-aggregators creating
deferred collectors before their parent aggregator and then the parent
aggregator trying to collect sub aggregators directly instead of going through
the deferred wrapper.

This commit fixes the issue but we should try to simplify all the pre/post
collection logic that we have.

Also `breadth_first` is now automatically ignored if the sub aggregators need
scores (just like we ignore `execution_mode` when the value does not make sense
like using ordinals on a script).

Close elastic#9823
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 9, 2015
The refactoring in elastic#9544 introduced a regression that broke multi-level
aggregations using breadth-first. This was due to sub-aggregators creating
deferred collectors before their parent aggregator and then the parent
aggregator trying to collect sub aggregators directly instead of going through
the deferred wrapper.

This commit fixes the issue but we should try to simplify all the pre/post
collection logic that we have.

Also `breadth_first` is now automatically ignored if the sub aggregators need
scores (just like we ignore `execution_mode` when the value does not make sense
like using ordinals on a script).

Close elastic#9823
@clintongormley clintongormley changed the title Aggs: Refactor aggregations to use lucene5-style collectors. Refactor aggregations to use lucene5-style collectors. Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregations: delegation of setNextReader calls
2 participants