Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2229

SimpleSpanFragmenter fails to start a new fragment

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 5.4.1, 5.5
    • modules/highlighter
    • None

    Description

      SimpleSpanFragmenter fails to identify a new fragment when there is more than one stop word after a span is detected. This problem can be observed when the Query contains a PhraseQuery.

      The problem is that the span extends toward the end of the TokenGroup. This is because waitForProps = positionSpans.get.end + 1; and position += posIncAtt.getPositionIncrement(); this generates a value of position greater than the value of waitForProps and (waitForPos == position) never matches.

      SimpleSpanFragmenter.java
        public boolean isNewFragment() {
          position += posIncAtt.getPositionIncrement();
      
          if (waitForPos == position) {
            waitForPos = -1;
          } else if (waitForPos != -1) {
            return false;
          }
      
          WeightedSpanTerm wSpanTerm = queryScorer.getWeightedSpanTerm(termAtt.term());
      
          if (wSpanTerm != null) {
            List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();
      
            for (int i = 0; i < positionSpans.size(); i++) {
              if (positionSpans.get(i).start == position) {
                waitForPos = positionSpans.get(i).end + 1;
                break;
              }
            }
          }
         ...
      

      An example is provided in the test case for the following Document and the query "all tokens" followed by the words of a.

      Document

      "Attribute instances are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it retuns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances."

      HighlighterTest.java
       public void testSimpleSpanFragmenter() throws Exception {
      
          ...
      
          doSearching("\"all tokens\"");
      
          maxNumFragmentsRequired = 2;
          
          scorer = new QueryScorer(query, FIELD_NAME);
          highlighter = new Highlighter(this, scorer);
      
          for (int i = 0; i < hits.totalHits; i++) {
            String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
            TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));
      
            highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));
      
            String result = highlighter.getBestFragments(tokenStream, text,
                maxNumFragmentsRequired, "...");
            System.out.println("\t" + result);
      
          }
        }
      
      Result

      are reused for <B>all</B> <B>tokens</B> of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it retuns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances.

      Expected Result

      for <B>all</B> <B>tokens</B> of a document

      Attachments

        1. LUCENE-2229.patch
          3 kB
          Lukhnos Liu
        2. LUCENE-2229.patch
          4 kB
          Lukhnos Liu
        3. LUCENE-2229.patch
          3 kB
          Elmer Garduno

        Activity

          People

            dsmiley David Smiley
            elmer.garduno Elmer Garduno
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified