HNSW using the linear package #3691

normen662 · 2025-10-22T17:21:01Z

This PR implements the HNSW paper using the recently introduced linear package together with RaBitQ.

alecgrieser

This has obviously taken a bit of time, but this is part one of the review. It covers:

The HNSW class and core algorithm
The Node and NodeKind classes

Still yet to look at are:

The StorageAdapter and implementations
Change sets
Any of the changes to the linear and RaBitQ packages
All tests

As hopefully is clear in the review, a lot of what's in it are requests for clarification. Some of these should probably turn into comments.

I also think that it would be good to take another look at the teamscale findings. Most of those are also pretty minor, but it would be good to try to conform a bit more to them. I'm less concerned about things like method length, nesting, or number of parameters (especially for private methods), but it would be nice to take another look at them.

Overall, I think the approach makes sense, through. Nice!

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

alecgrieser · 2025-10-24T09:10:17Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

+            result = 31 * result + Boolean.hashCode(keepPrunedConnections);
+            result = 31 * result + Boolean.hashCode(useRaBitQ);
+            result = 31 * result + raBitQNumExBits;
+            return result;


Any reason this isn't using Objects.hashCode?

Integer.hashCode(x) is defined as x. When you generate the hashCode() in IntelliJ it just does it like this. I guess we could also use Objects.hashCode(...) from Guava.

Sorry, I meant Objects.hash, which combines the hashes of its constituents (multiplying intermediate results by 31) in a way that's just like what you did here.

They all now use Guava's hashCode()/equals(). Not sure what I did in my IntelliJ setup. Older classes stilled seemed to have been generated guava style.

Could we use the JDK version Objects.hash instead of Guava's Objects.hashCode()? Under the covers, those are actually the same, and it's not like we don't already have Guava things in the code base, but it would be nice to limit Guava a bit here

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

alecgrieser · 2025-10-29T14:13:43Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

+                        final int efSearch,
+                        @Nonnull final Map<Tuple, Node<N>> nodeCache,
+                        @Nonnull final RealVector queryVector) {
+        final Set<Tuple> visited = Sets.newConcurrentHashSet(NodeReference.primaryKeys(nodeReferences));


I'm not actually sure how much is bought here by using concurrent data structures (including both the ConcurrentHashSet here as well as PriorityBlockingQueues below). I think there are some critical sections that won't if there are "real" concurrent accesses to these data structures (for example, inserting a new element to the nearest neighbors if and only if the old size is less than efSearch or the new element is closer to queryVector than the current furthest layer). The current algorithm is not parallelized, so I don't think that this is a problem yet, or that using concurrent data structures is wrong, per se, though the story here is a bit incomplete

There are cases where e.g. the nodeCache can be updated concurrently since we use/potentially update the node cache in a forEach(...). BlockingQueues provide drainTo(...).

Yeah, I see now the way that a ConcurrentHashSet is probably necessary.

I don't think having drainTo on a BlockingQueue is quite justification enough, as it's simple enough to just poll() repeatedly from a non-blocking queue and get the same effect. If we want it for other concurrency guarantees, then fine

Changed PriorityBlockingQueue to PriorityQueue.

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AbstractNode.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AbstractStorageAdapter.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AccessInfo.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AggregatedVector.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/BaseNeighborsChangeSet.java

fdb-extensions/src/main/java/com/apple/foundationdb/linear/AffineOperator.java

fdb-extensions/src/main/java/com/apple/foundationdb/linear/VectorOperator.java

fdb-extensions/src/test/java/com/apple/foundationdb/async/hnsw/HNSWHelpersTest.java

MMcM · 2025-10-30T00:35:28Z

fdb-extensions/src/test/java/com/apple/foundationdb/async/hnsw/HNSWTest.java

+@SuppressWarnings("checkstyle:AbbreviationAsWordInName")
+@Tag(Tags.RequiresFDB)
+@Tag(Tags.Slow)
+public class HNSWTest {


The division Into tests here actually suggests how the HNSW test might be broken up to be more manageable in size, and then unit tested in parts.

alecgrieser

Okay, this adds more to the review, in particular focusing on the storage serialization/deserialization. I still have:

The changes to the other packages
Tests
Looking at updates since the last review

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AccessInfo.java

alecgrieser · 2025-10-30T11:20:43Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java

+    /**
+     * Subspace for (mostly) statistical analysis (like finding a centroid, etc.). Contains samples of vectors.
+     */
+    byte SUBSPACE_PREFIX_SAMPLES = 0x03;


What happened to 0x02? I see you have 0x00, 0x01, and 0x03.

It also might be a bit misleading to make these bytes, as I would assume that with a byte, we'd store things under specific byte prefixes. Something like:

public Subspace getDataSubspace() { byte[] baseSubspace = getSubspace().pack(); byte[] dataSubspace = ByteArrayUtil.join(baseSubspace, new byte[]{SUBSPACE_PREFIX_DATA}); return new Subspace(dataSubspace): }

That's not what happens, and I think it's not what we'd want to happen either. Instead, we do something like:

public Subspace getDataSubspace() { return getSubspace().subspace(Tuple.from(SUBSPACE_PREFIX_DATA); }

That would actually implicitly cast SUBSPACE_PREFIX_DATA as an integer, and then use the tuple encoding for the integer value. Which actually is probably better as it gives us a bit more flexibility and enforces the invariant that all of our keys are tuple parseable (see: #3566 (comment)), but if we do do that, we might as well make these constants longs or ints

Some people have trouble counting. :-)

Do you want me to make them int or long? I kind of like these things as tight as possible type wise. So unless we want to at some point use a 0xDEADBEEF or something I kind of like it better this way.

I was thinking longs. That is what we use in FDBRecordStoreKeyspace, for example, as well as what the KeySpacePath expects for integral tuple fields.

I'm not sure I buy your argument here that its better if the Java object here have fewer bits. In particular, I think byte is more appropriate if we're somehow leveraging the fact that it's already a byte in the encoding (e.g., appending it to some byte[]), not if it's just some integral value, like it is here. But I guess it doesn't really matter at the end of the day.

alecgrieser · 2025-10-30T11:22:18Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java

+     * @return the subspace containing the data, which is guaranteed to be non-null
+     */
+    @Nonnull
+    Subspace getDataSubspace();


Is this not a default method because the AbstractStorageAdapter caches the result of constructing the data subspace from the subspace? We'd still expect all of the implementations to consistently return the same value for the data subspace, right?

I assume the data subspace is the most important one to cache because it is accessed the most. Do you think we'd want to cache the other two as well?

Meant to add those as well. But the other subspaces are only used in a static context as these subspaces are accessed outside the concept of a layer.

alecgrieser · 2025-10-30T11:34:29Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java

+    /**
+     * Subspace for data.
+     */
+    byte SUBSPACE_PREFIX_DATA = 0x01;


I'm not quite sure how much this matters, but the encoding for 1 is two bytes long (\x15\x01), whereas the encoding for 0 is only one byte (\x14). We may be able to keep the keys slightly shorter if we make SUBSPACE_PREFIX_DATA be zero instead of SUBSPACE_PREFIX_INDEX_ACCESS_INFO. The value null (and false and true) also all have one byte tuple codes, if we like those better for some reason.

All of this is a 1 byte (per key) thing, and I'm not sure it matters as much on RocksDB as it did on sqlite, as RocksDB compression should be able to compress away common prefixes anyway.

I reverted data and ACCESS_INFO, DATA is now 0x00.

alecgrieser · 2025-10-30T11:52:07Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java

+     * @param maxNumRead the maximum number of nodes to return in this scan
+     * @return an {@link Iterable} that provides the nodes found in the specified layer range
+     */
+    Iterable<Node<N>> scanLayer(@Nonnull ReadTransaction readTransaction, int layer, @Nullable Tuple lastPrimaryKey,


Should this return an AsyncIterable? Or should it return a CompletableFuture<Collection<Node<N>>> or something like that, and then it always reads (up to) maxNumRead and returns them in the future?

I think I can make that an AsyncIterable. While more idiomatic, it doesn't buy us anything as the caller is just going to iterate over this in a non-asynchronous way anyway. But it's definitely an improvement.

I made it a regular Iterable as it is fine for debugging and the code in InliningStorageAdapter needs to reduce from the neighbors to the nodes which needs some more complicated logic there.

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/AbstractStorageAdapter.java

alecgrieser · 2025-10-30T14:59:40Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/InliningStorageAdapter.java

+            neighborsBuilder.add(neighbor);
+        }
+
+        // there may be a rest; throw it away


Does this not have the problem that the last node in the layer will never get added to the list? The idea being that because it always ignores the final contents of neighborsBuilder (because it doesn't know if there are more neighbors or not), it will never return the final node in the layer

The code driving this will continue executing this until there is no new node to be read anymore. So if you throw away some entries here, the next one will get it. The assumption, which I think is a fair one (also given that this is debug code) is that a batch size can at least accommodate one node's data.

Yes, but what about the very last node in the layer? As far as I can tell, the only way it knows to return data is it finds a key that is not part of the node, but for the very last node in the layer, no such key exists.

alecgrieser · 2025-10-30T15:27:01Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/CompactStorageAdapter.java

+                                  @Nonnull final NeighborsChangeSet<NodeReference> neighborsChangeSet) {
+        final byte[] key = getDataSubspace().pack(Tuple.from(layer, node.getPrimaryKey()));
+
+        final List<Object> nodeItems = Lists.newArrayListWithExpectedSize(3);


This is another serialization form that we'd probably use protobuf for if protobuf was available to us. We don't really gain anything by using tuples here, as we don't need the values to be sorted. There's some amount of overhead to tuples that comes from the fact that their serialization needs to preserve order, but not that much. It's probably fine to continue using tuples unless we wanted to use something like protobuf for some other kind of benefit (like a clearer evolution path)

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageTransform.java

fdb-extensions/src/main/java/com/apple/foundationdb/linear/AffineOperator.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/Config.java

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

alecgrieser · 2025-10-30T16:57:07Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

+                        @Nonnull final RealVector queryVector) {
+        final Set<Tuple> visited = Sets.newConcurrentHashSet(NodeReference.primaryKeys(entryNeighbors));
+        final Queue<NodeReferenceWithDistance> candidates =
+                new PriorityBlockingQueue<>(config.getM(),


I suppose it's true that candidates is not bound by efSearch because we don't trim the candidates when we add a new one. But we only add a new one if it's in the historical top efSearch, so...I'm not convinced that M is a better guess here for the initial capacity. But as you say, it is only a guess for the initial capacity, so it may not matter

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java

MMcM · 2025-10-30T18:59:25Z

fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/StorageAdapter.java

+     * @param maxNumRead the maximum number of nodes to return in this scan
+     * @return an {@link Iterable} that provides the nodes found in the specified layer range
+     */
+    Iterable<AbstractNode<N>> scanLayer(@Nonnull ReadTransaction readTransaction, int layer,


Should this return an AsyncIterable? The implementation seem to actually.

The implementation in InliningStorageAdapter actually produces a regular Iterable which is tricky to make without creating an asynchronous reduce operator on AsyncIterable.

…inear package

alecgrieser mentioned this pull request Oct 23, 2025

vector basics; matrixes, serialization and half-precision floating point support #3677

Merged

normen662 force-pushed the hnsw-on-linear branch 4 times, most recently from 4cfef2a to 3a04055 Compare October 24, 2025 16:49

normen662 requested a review from alecgrieser October 28, 2025 19:17

normen662 added the enhancement New feature or request label Oct 28, 2025

normen662 force-pushed the hnsw-on-linear branch 3 times, most recently from 8fcb3a6 to fc7994f Compare October 29, 2025 13:15

alecgrieser requested changes Oct 29, 2025

View reviewed changes

MMcM approved these changes Oct 30, 2025

View reviewed changes

normen662 force-pushed the hnsw-on-linear branch 2 times, most recently from 267e633 to 1963a0a Compare October 30, 2025 15:41

alecgrieser requested changes Oct 30, 2025

View reviewed changes

MMcM requested changes Oct 30, 2025

View reviewed changes

normen662 force-pushed the hnsw-on-linear branch 3 times, most recently from 4bd5e3e to 3bd17bc Compare November 2, 2025 20:44

normen662 added 10 commits November 3, 2025 12:01

initial code drop from hnsw-poc

b24aece

adding tests

6c1b786

adding javadocs

1e987a1

adding comments

f8d6a09

more javadoc and tests

34365d0

adding a lot of java doc

db93790

added tests

0a65805

increase timeout for test case

ad20471

refactored Vector class to be more aligned with math libraries

3195be8

removed efSearch from HNSW

9fe71ef

normen662 added 27 commits November 3, 2025 12:01

rabitq in hnsw works

3b6b61e

basic vector encoding, half support

0ea0d65

refactoring so that feature branch hnsw and rabitq can use a proper l…

22e02b4

…inear package

addressing some comments

f22dbec

addressing some comments (2)

a990558

addressing some comments (3)

27c7136

addressing some comments (4)

1e156e8

addressing some comments (5)

725ec63

addressing some comments (6)

22d38b1

more tests

464ec85

more tests

884804b

code complete

777ab6f

code complete -- for realz

f9bcc9c

rebase fallout

1134ff3

addressing some comments from FoundationDB#3677

db6fe8b

changes to StorageAdapter needed for vector samples

4fa3210

adding affine operator

2b31ff9

adding storageTransformOperator

d1796ae

HNSW uses rabitq and affine operators

16eeea5

save point

2c85618

rabitq works inside of HNSW

3310451

rabitq works inside of HNSW

095f323

adressing comments and simplifying code

dc09b74

adressing comments and simplifying code

fc0643d

adressing comments and simplifying code

1535321

adressing comments and simplifying code

8254739

inverting the apply-invertedApply

ae49bda

normen662 force-pushed the hnsw-on-linear branch from 3bd17bc to 090e1a6 Compare November 3, 2025 11:01

Transformed

27fd9ef

normen662 force-pushed the hnsw-on-linear branch from 090e1a6 to 27fd9ef Compare November 3, 2025 11:07

HNSW using the linear package #3691

Are you sure you want to change the base?

HNSW using the linear package #3691

Conversation

normen662 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alecgrieser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normen662 Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alecgrieser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normen662 Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normen662 commented Oct 22, 2025 •

edited

Loading

normen662 Oct 30, 2025 •

edited

Loading

normen662 Nov 3, 2025 •

edited

Loading