SlideShare a Scribd company logo
1 of 13
Nested Documents in Lucene High-performance support for parent/child document relations mark@searcharea.co.uk
Problem: The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document. Single Lucene document
Problem: “Cross-matching” When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume John Name John A1 in Maths A1, E1 Grade E1 in Science Subject Maths, Science ! False match for query: Grade:A1 AND Subject:Science
Unacceptable solution #1 One modeling approach is to store related items in the same field and use proximity operators in queries Name John A1 Maths….E1 Science GradeAndSubject John Example query:  “GradeAndSubject:”A1 Science”~2 A1 in Maths E1 in Science ! Slow ! Not scalable with number of fields  ,[object Object]
 Proximity distances must grow.
 Only one choice of Analyzer for given field ,[object Object],[object Object]
Solution: Nested Document Queries Nested documents need to be queried using new NestedDocumentQuery class which understands document relationships John Name A1 E1 Grade Grade docType resume Subject Maths Subject Science New NestedDocumentQuery ,[object Object]
 Reports any matches as a match on the parent document not the child
 Super-fast evaluation of joins between child and parent
 Requires an indexed field to identify parent documents?
Solution: Example Query Find resume of person called “John” with A1 grade in Maths John Name E1 A1 resume Grade docType Grade Subject Science Subject Maths The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic
Solution: Join speed Unlike a database, the cost of a join (child to parent) is blisteringly fast 3) Find first prior set bit e.g. position #356,670 100000100000000100000001000000010000001000010000000001000000100000100001 2) Index directly into cached BitSet at position #356,675 1) Match reported on document #356,675 ParentQuery 4) Attribute match to doc #356,670 NestedDocumentQuery ChildQuery The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)
Other advantages Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website) Nesting levels can be arbitrarily deep  Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

More Related Content

What's hot

Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSematext Group, Inc.
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for DatabricksDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 

What's hot (20)

Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 

Viewers also liked

Grouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/SolrGrouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/Solrlucenerevolution
 
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsApproaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsLucidworks
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
 
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015NoSQLmatters
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Patterns for large scale search
Patterns for large scale searchPatterns for large scale search
Patterns for large scale searchMark Harwood
 
Lucene with Bloom filtered segments
Lucene with Bloom filtered segmentsLucene with Bloom filtered segments
Lucene with Bloom filtered segmentsMark Harwood
 
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsFaceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsLucidworks
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
 
Understanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal KucUnderstanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal Kuclucenerevolution
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
 

Viewers also liked (14)

Grouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/SolrGrouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/Solr
 
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsApproaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
 
Lucene KV-Store
Lucene KV-StoreLucene KV-Store
Lucene KV-Store
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
 
MaFI Meeting 2016 (slides)
MaFI Meeting 2016 (slides)MaFI Meeting 2016 (slides)
MaFI Meeting 2016 (slides)
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Patterns for large scale search
Patterns for large scale searchPatterns for large scale search
Patterns for large scale search
 
Lucene with Bloom filtered segments
Lucene with Bloom filtered segmentsLucene with Bloom filtered segments
Lucene with Bloom filtered segments
 
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsFaceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
Understanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal KucUnderstanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal Kuc
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
An Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache SolrAn Introduction to Basics of Search and Relevancy with Apache Solr
An Introduction to Basics of Search and Relevancy with Apache Solr
 

Similar to Proposal for nested document support in Lucene

11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...Alexander Decker
 
4.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-354.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-35Alexander Decker
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringJinho Choi
 
Data models and ro
Data models and roData models and ro
Data models and roDiana Diana
 
Entity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutionsEntity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Equation 2.doc
Equation 2.docEquation 2.doc
Equation 2.docbutest
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Mariana Damova, Ph.D
 
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp SchollSemantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp SchollCROKODIl consortium
 
Introduction to Data Management Powerpoint
Introduction to Data Management PowerpointIntroduction to Data Management Powerpoint
Introduction to Data Management Powerpointichanismo
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...AI Publications
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsThomas Lee
 

Similar to Proposal for nested document support in Lucene (20)

11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...
 
4.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-354.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-35
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-Answering
 
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
 
Data models and ro
Data models and roData models and ro
Data models and ro
 
Entity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutionsEntity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutions
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Equation 2.doc
Equation 2.docEquation 2.doc
Equation 2.doc
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011
 
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp SchollSemantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
 
B01DataMgt.ppt
B01DataMgt.pptB01DataMgt.ppt
B01DataMgt.ppt
 
Introduction to Data Management Powerpoint
Introduction to Data Management PowerpointIntroduction to Data Management Powerpoint
Introduction to Data Management Powerpoint
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Cl4201593597
Cl4201593597Cl4201593597
Cl4201593597
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Proposal for nested document support in Lucene

  • 1. Nested Documents in Lucene High-performance support for parent/child document relations mark@searcharea.co.uk
  • 2. Problem: The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document. Single Lucene document
  • 3. Problem: “Cross-matching” When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume John Name John A1 in Maths A1, E1 Grade E1 in Science Subject Maths, Science ! False match for query: Grade:A1 AND Subject:Science
  • 4.
  • 6.
  • 7.
  • 8. Reports any matches as a match on the parent document not the child
  • 9. Super-fast evaluation of joins between child and parent
  • 10. Requires an indexed field to identify parent documents?
  • 11. Solution: Example Query Find resume of person called “John” with A1 grade in Maths John Name E1 A1 resume Grade docType Grade Subject Science Subject Maths The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic
  • 12. Solution: Join speed Unlike a database, the cost of a join (child to parent) is blisteringly fast 3) Find first prior set bit e.g. position #356,670 100000100000000100000001000000010000001000010000000001000000100000100001 2) Index directly into cached BitSet at position #356,675 1) Match reported on document #356,675 ParentQuery 4) Attribute match to doc #356,670 NestedDocumentQuery ChildQuery The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)
  • 13. Other advantages Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website) Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)
  • 14. “Lucene is not a database”, but….. Structure matters Many data sources are a mix of structured and unstructured content (e.g. microformats). This is unlikely to change. Lucene has historically been about unstructured text but has steadily been adding structured capability (Trie, spatial, facets) and become a great solution for hybrid data. However support for modeling and querying non-trivial data structures is missing currently. Relationships matter This proposal is not to recreate the full capabilities of a SQL database with arbitrary relationships. However we can benefit greatly from providing simple parent-child relationships We have some unique capabilities Parent-child joins are very fast Unlike SQL we can return partial, relevance-ranked matches Probably more akin to XML databases than SQL databases
  • 15. Next steps Existing code/unit tests can be released to Lucene project if there is sufficient interest. This software has been deployed in production on large datasets. The matching approach is reliant on parents and children being held in the same Lucene index segment. Additional control is needed to enforce this more rigorously - either by Adding more user-control over IndexWritersegment creation where applications understand/control parent-child dependencies OR Making Lucene aware of parent-child relationships e.g. new method Document.add(Document) Query parser support XML Query Parser support is available End-user Query parser could add new syntax e.g. +candidateLocale:UK +child(grade:A1 AND subject:music)
  • 16. Thoughts? Feedback encouraged on dev@lucene.apache.org