- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.2k
 
LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           That is really odd that the   | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is close! I left a bunch of small comments. The performance is odd but I don't think that should block this improvement.
        
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | /* | ||
| If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal | ||
| */ | ||
| if (values == null || ordinals[i] >= leafReaderMaxDoc) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't we need to compare ordinals[i] >= leafReaderMaxDoc + leafReaderDocBase instead?  I.e. the ordinal is in the global IndexReader docid space, but the leafReaderMaxDoc is just for this leaf.  How come tests are not failing due to this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a big miss. Thanks for catching this.
Here is why the test was passing:
When the ordinals were all in the first leaf, the ordinal was < leafReaderMaxDoc and the wrong code worked correctly.
When the ordinals started spilling into the second leaf, the wrong code would recalculate the leaf each time and then  find the correct Label. Thus were not actually making use of the increasing values of the ordinals (and were recalculating the leafContext each time).
This would go on for all other leaves as well.
I'll rerun benchmarks
| 
           Hey @mikemccand , my apologies for leaving this PR hanging! The unexpected benchmark results threw me off.. I've updated the PR to fix the bug you had previously discovered and updated the PR to bring it on par with mainline. I re-ran benchmarks after fixing the  The gain in taxonomy based facets is about 2%. I ran it multiple times locally and it gave around 2% gain in those runs as well. SSDV facets still show a 3% gain which is unexplained.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iteration @gautamworah96 -- we are getting closer!
        
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
      contention). Modified the test case to be super random (threads, iterations etc). Modified TaxonomyReader with a getBulkPath method that is overridden in DirTaxonomyReader. Improved documentation, improved method signatures to use the newer (...) style syntax, fixed the CHANGES entry
        
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks super close! I left some small comments.
        
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
              
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- I left tiny polish comments -- almost done! Thanks @gautamworah96!
        
          
                lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
          
            Show resolved
            Hide resolved
        
      | /* | ||
| If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null | ||
| */ | ||
| if (values == null) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sort of weird to do this check per-segment, since we can know up front, based on indexed created version, whether it's stored fields or doc values? I.e. this is not a segment by segment decision. But let's do it in follow-on issue. I think this one is ready!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged :D
        
          
                lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java
          
            Show resolved
            Hide resolved
        
              
          
                ...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gautamworah96 -- looks great! I'll try to merge soon! Then we should watch nightly Lucene benchmarks and see if facets got a bit faster.
| 
           Merged! @gautamworah96 I think we should backport this one to 8.x? It is just adding a faster API.  | 
    
Resolved Conflicts: gradle/validation/validate-source-patterns.gradle solr/core/src/java/org/apache/solr/core/CoreContainer.java
          
 This change relies on the older repo having the "LUCENE-9450 Switch taxonomy index to use BinaryDocValues" change. We had initially decided that we will not backport it to 8x because it was a major change to the index type/format. Thus this change also can't be backported :/  | 
    
          
 Ahh, yes, sorry! We cannot backport this change then. So this is 9.0 only optimization.  | 
    
Description
In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal.
It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex.
This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object.
This PR was originally proposed in the older
lucene-solrrepo here. I discontinued it because we could not find a good "real" use case for it.This new PR uses the
getBulkPathAPI insideIntTaxonomyFacetCountsandFloatTaxonomyFacetCountsclasses. In thegetTopChildrenmethod within these classes, we usually collect all thetop-nordinals first and then translate them intoFacetLabelsone by one using an iterative API. This call has been replaced with thegetBulkPathmethod in this PR.Solution
Steps:
Tests
getPathresults from ordinals with the bulkFacetLabelsreturned by the getBulkPath APIBenchmarks:
The stock
luceneutiltaxonomy tasks callgetTopChildren(10, ..)for testing the performance of taxonomy facet counting classes. SinceFastTaxonomyFacetCountsinherits fromIntTaxonomyFacetCounts(and uses the samegetTopChildrenlogic) the user should be able to reap the benefits of the bulk API directly.luceneutilthat tests the top 10 children:Since, the previous results did not show any gain in the
Browse*TaxoFacetstasks, I realized that maybe it was because the ordinalCache with a stock size of 4000 had all ordinals cached which led to the bulk API behaving the same as the iterativegetPathcall.The next benchmark tested the
getTopChildrencall with 10,000 childrenluceneutilthat tests the top 10000 children:I surprisingly see some gain in
BrowseDayOfYearSSDVFacetsandBrowseMonthSSDVFacetstasks but I think that is noise (because taxonomy facets have a different impl. as compared toSortedSetDocValuesFacets).Edit: Did an A/A test and
BrowseDayOfYearSSDVFacetsdoes not show any gain. So theBrowseDayOfYearSSDVFacets4.1% perf improvement is unexplained..Browse*TaxoFacetstasks don't show gain.Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.