Proposal for nested document support in Lucene

•

20 likes•15,485 views

Mark Harwood

Technology

Nested Documents in Lucene High-performance support for parent/child document relations mark@searcharea.co.uk

Problem: The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document. Single Lucene document

Problem: “Cross-matching” When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume John Name John A1 in Maths A1, E1 Grade E1 in Science Subject Maths, Science ! False match for query: Grade:A1 AND Subject:Science

Unacceptable solution #1 One modeling approach is to store related items in the same field and use proximity operators in queries Name John A1 Maths….E1 Science GradeAndSubject John Example query: “GradeAndSubject:”A1 Science”~2 A1 in Maths E1 in Science ! Slow ! Not scalable with number of fields ,[object Object]

Only one choice of Analyzer for given field ,[object Object],[object Object]

Solution: Nested Document Queries Nested documents need to be queried using new NestedDocumentQuery class which understands document relationships John Name A1 E1 Grade Grade docType resume Subject Maths Subject Science New NestedDocumentQuery ,[object Object]

Reports any matches as a match on the parent document not the child

Super-fast evaluation of joins between child and parent

Requires an indexed field to identify parent documents?

Solution: Example Query Find resume of person called “John” with A1 grade in Maths John Name E1 A1 resume Grade docType Grade Subject Science Subject Maths The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic

Solution: Join speed Unlike a database, the cost of a join (child to parent) is blisteringly fast 3) Find first prior set bit e.g. position #356,670 100000100000000100000001000000010000001000010000000001000000100000100001 2) Index directly into cached BitSet at position #356,675 1) Match reported on document #356,675 ParentQuery 4) Attribute match to doc #356,670 NestedDocumentQuery ChildQuery The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)

Other advantages Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website) Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

What's hot

Solr Search Engine: Optimize Is (Not) Bad for YouSematext Group, Inc.

Common issues with Apache Kafka® Producerconfluent

Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Streaming SQL with Apache CalciteJulian Hyde

DevOps for DatabricksDatabricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Parquet and AVROairisData

Free Training: How to Build a LakehouseDatabricks

Introduction to SolrErik Hatcher

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

Parquet performance tuning: the missing guideRyan Blue

Self-learned Relevancy with Apache SolrTrey Grainger

Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Spark Core – Practical OptimizationDatabricks

What's hot (20)

Solr Search Engine: Optimize Is (Not) Bad for You

Common issues with Apache Kafka® Producer

Data Quality Patterns in the Cloud with Azure Data Factory

Performance Tuning RocksDB for Kafka Streams’ State Stores

Streaming SQL with Apache Calcite

DevOps for Databricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Introduction SQL Analytics on Lakehouse Architecture

Parquet and AVRO

Free Training: How to Build a Lakehouse

Introduction to Solr

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

Parquet performance tuning: the missing guide

Self-learned Relevancy with Apache Solr

Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Apache Spark Core – Practical Optimization

Viewers also liked

Grouping and Joining in Lucene/Solrlucenerevolution

Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsLucidworks

Lucene KV-StoreMark Harwood

Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015NoSQLmatters

MaFI Meeting 2016 (slides)MaFI (The Market Facilitation Initiative)

Solr search engine with multiple table relationJay Bharat

Patterns for large scale searchMark Harwood

Lucene with Bloom filtered segmentsMark Harwood

Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsLucidworks

Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution

Understanding and visualizing solr explain information - Rafal Kuclucenerevolution

Working with deeply nested documents in Apache SolrAnshum Gupta

An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)

Viewers also liked (14)

Grouping and Joining in Lucene/Solr

Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics

Lucene KV-Store

Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

MaFI Meeting 2016 (slides)

Solr search engine with multiple table relation

Patterns for large scale search

Lucene with Bloom filtered segments

Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics

Is Your Index Reader Really Atomic or Maybe Slow?

Understanding and visualizing solr explain information - Rafal Kuc

Working with deeply nested documents in Apache Solr

An Introduction to Basics of Search and Relevancy with Apache Solr

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

DMCC Future of Trade Web3 - Special Edition

Scanning the Internet for External Cloud Exposures via SSL Certs

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

SQL Database Design For Developers at php[tek] 2024

Pigging Solutions in Pet Food Manufacturing

Key Features Of Token Development (1).pptx

Science&tech:THE INFORMATION AGE STS.pdf

Are Multi-Cloud and Serverless Good or Bad?

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Connect Wave/ connectwave Pitch Deck Presentation

My Hashitalk Indonesia April 2024 Presentation

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Unlocking the Potential of the Cloud for IBM Power Systems

Human Factors of XR: Using Human Factors to Design XR Systems

Proposal for nested document support in Lucene

1. Nested Documents in Lucene High-performance support for parent/child document relations mark@searcharea.co.uk

2. Problem: The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document. Single Lucene document

3. Problem: “Cross-matching” When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume John Name John A1 in Maths A1, E1 Grade E1 in Science Subject Maths, Science ! False match for query: Grade:A1 AND Subject:Science

5. Proximity distances must grow.

8. Reports any matches as a match on the parent document not the child

9. Super-fast evaluation of joins between child and parent

10. Requires an indexed field to identify parent documents?

11. Solution: Example Query Find resume of person called “John” with A1 grade in Maths John Name E1 A1 resume Grade docType Grade Subject Science Subject Maths The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic

12. Solution: Join speed Unlike a database, the cost of a join (child to parent) is blisteringly fast 3) Find first prior set bit e.g. position #356,670 100000100000000100000001000000010000001000010000000001000000100000100001 2) Index directly into cached BitSet at position #356,675 1) Match reported on document #356,675 ParentQuery 4) Attribute match to doc #356,670 NestedDocumentQuery ChildQuery The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)

13. Other advantages Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website) Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

14. “Lucene is not a database”, but….. Structure matters Many data sources are a mix of structured and unstructured content (e.g. microformats). This is unlikely to change. Lucene has historically been about unstructured text but has steadily been adding structured capability (Trie, spatial, facets) and become a great solution for hybrid data. However support for modeling and querying non-trivial data structures is missing currently. Relationships matter This proposal is not to recreate the full capabilities of a SQL database with arbitrary relationships. However we can benefit greatly from providing simple parent-child relationships We have some unique capabilities Parent-child joins are very fast Unlike SQL we can return partial, relevance-ranked matches Probably more akin to XML databases than SQL databases

15. Next steps Existing code/unit tests can be released to Lucene project if there is sufficient interest. This software has been deployed in production on large datasets. The matching approach is reliant on parents and children being held in the same Lucene index segment. Additional control is needed to enforce this more rigorously - either by Adding more user-control over IndexWritersegment creation where applications understand/control parent-child dependencies OR Making Lucene aware of parent-child relationships e.g. new method Document.add(Document) Query parser support XML Query Parser support is available End-user Query parser could add new syntax e.g. +candidateLocale:UK +child(grade:A1 AND subject:music)

16. Thoughts? Feedback encouraged on dev@lucene.apache.org

Proposal for nested document support in Lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Proposal for nested document support in Lucene

Similar to Proposal for nested document support in Lucene (20)

Recently uploaded

Recently uploaded (20)

Proposal for nested document support in Lucene