-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Is your feature request related to a problem? Please describe
Dynamic pruning algorithms work by dynamically adding negating filters into the query as disjunctions which are non-competitive( or found to be no more competitive while query execution) to prune the search space.
One of the utility is cardinality aggregations. Instead of using negative filter in lucene, if the field is a low cardinality field (to avoid explosion of disjunctions), then the query can be rewritten as a disjunction query of all the unique terms of the field. As the matching documents are evaluated, if the field value is a unique encountered so far, then it can be safely removed from the disjunction query. This ensures that the documents aren't processed twice for a unique value of a field on which cardinality aggregation is run. Also, the query will be early terminated if all disjunctive filters are exhausted.
Describe the solution you'd like
This logic can easily be embedded into Query.rewrite() method when the cardinality aggregation is the only aggregation and field is low cardinality field. Field cardinality upper bound for a query can be estimated either from FieldReader size() or SortedSetDocValues getValueCount().
We need run some benchmarks in order to know when to start enabling this optimization as it may not be very helpful in smaller corpus. I propose running it against noaa workload - https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/bdbd4bbd74fbf319398de1ca169f16744821bcde/noaa/operations/default.json#L765
Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Status