Skip to content

Conversation

@normen662
Copy link
Contributor

@normen662 normen662 commented Oct 22, 2025

This PR implements the HNSW paper using the recently introduced linear package together with RaBitQ.

@normen662 normen662 force-pushed the hnsw-on-linear branch 4 times, most recently from 4cfef2a to 3a04055 Compare October 24, 2025 16:49
@normen662 normen662 requested a review from alecgrieser October 28, 2025 19:17
@normen662 normen662 added the enhancement New feature or request label Oct 28, 2025
@normen662 normen662 force-pushed the hnsw-on-linear branch 3 times, most recently from 8fcb3a6 to fc7994f Compare October 29, 2025 13:15
Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has obviously taken a bit of time, but this is part one of the review. It covers:

  1. The HNSW class and core algorithm
  2. The Node and NodeKind classes

Still yet to look at are:

  1. The StorageAdapter and implementations
  2. Change sets
  3. Any of the changes to the linear and RaBitQ packages
  4. All tests

As hopefully is clear in the review, a lot of what's in it are requests for clarification. Some of these should probably turn into comments.

I also think that it would be good to take another look at the teamscale findings. Most of those are also pretty minor, but it would be good to try to conform a bit more to them. I'm less concerned about things like method length, nesting, or number of parameters (especially for private methods), but it would be nice to take another look at them.

Overall, I think the approach makes sense, through. Nice!

result = 31 * result + Boolean.hashCode(keepPrunedConnections);
result = 31 * result + Boolean.hashCode(useRaBitQ);
result = 31 * result + raBitQNumExBits;
return result;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this isn't using Objects.hashCode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.hashCode(x) is defined as x. When you generate the hashCode() in IntelliJ it just does it like this. I guess we could also use Objects.hashCode(...) from Guava.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant Objects.hash, which combines the hashes of its constituents (multiplying intermediate results by 31) in a way that's just like what you did here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They all now use Guava's hashCode()/equals(). Not sure what I did in my IntelliJ setup. Older classes stilled seemed to have been generated guava style.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the JDK version Objects.hash instead of Guava's Objects.hashCode()? Under the covers, those are actually the same, and it's not like we don't already have Guava things in the code base, but it would be nice to limit Guava a bit here

final int efSearch,
@Nonnull final Map<Tuple, Node<N>> nodeCache,
@Nonnull final RealVector queryVector) {
final Set<Tuple> visited = Sets.newConcurrentHashSet(NodeReference.primaryKeys(nodeReferences));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure how much is bought here by using concurrent data structures (including both the ConcurrentHashSet here as well as PriorityBlockingQueues below). I think there are some critical sections that won't if there are "real" concurrent accesses to these data structures (for example, inserting a new element to the nearest neighbors if and only if the old size is less than efSearch or the new element is closer to queryVector than the current furthest layer). The current algorithm is not parallelized, so I don't think that this is a problem yet, or that using concurrent data structures is wrong, per se, though the story here is a bit incomplete

Copy link
Contributor Author

@normen662 normen662 Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where e.g. the nodeCache can be updated concurrently since we use/potentially update the node cache in a forEach(...). BlockingQueues provide drainTo(...).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see now the way that a ConcurrentHashSet is probably necessary.

I don't think having drainTo on a BlockingQueue is quite justification enough, as it's simple enough to just poll() repeatedly from a non-blocking queue and get the same effect. If we want it for other concurrency guarantees, then fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed PriorityBlockingQueue to PriorityQueue.

@SuppressWarnings("checkstyle:AbbreviationAsWordInName")
@Tag(Tags.RequiresFDB)
@Tag(Tags.Slow)
public class HNSWTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The division Into tests here actually suggests how the HNSW test might be broken up to be more manageable in size, and then unit tested in parts.

@normen662 normen662 force-pushed the hnsw-on-linear branch 2 times, most recently from 267e633 to 1963a0a Compare October 30, 2025 15:41
Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this adds more to the review, in particular focusing on the storage serialization/deserialization. I still have:

  1. The changes to the other packages
  2. Tests
  3. Looking at updates since the last review

/**
* Subspace for (mostly) statistical analysis (like finding a centroid, etc.). Contains samples of vectors.
*/
byte SUBSPACE_PREFIX_SAMPLES = 0x03;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to 0x02? I see you have 0x00, 0x01, and 0x03.

It also might be a bit misleading to make these bytes, as I would assume that with a byte, we'd store things under specific byte prefixes. Something like:

public Subspace getDataSubspace() {
    byte[] baseSubspace = getSubspace().pack();
    byte[] dataSubspace = ByteArrayUtil.join(baseSubspace, new byte[]{SUBSPACE_PREFIX_DATA});
    return new Subspace(dataSubspace):
}

That's not what happens, and I think it's not what we'd want to happen either. Instead, we do something like:

public Subspace getDataSubspace() {
    return getSubspace().subspace(Tuple.from(SUBSPACE_PREFIX_DATA);
}

That would actually implicitly cast SUBSPACE_PREFIX_DATA as an integer, and then use the tuple encoding for the integer value. Which actually is probably better as it gives us a bit more flexibility and enforces the invariant that all of our keys are tuple parseable (see: #3566 (comment)), but if we do do that, we might as well make these constants longs or ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people have trouble counting. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to make them int or long? I kind of like these things as tight as possible type wise. So unless we want to at some point use a 0xDEADBEEF or something I kind of like it better this way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking longs. That is what we use in FDBRecordStoreKeyspace, for example, as well as what the KeySpacePath expects for integral tuple fields.

I'm not sure I buy your argument here that its better if the Java object here have fewer bits. In particular, I think byte is more appropriate if we're somehow leveraging the fact that it's already a byte in the encoding (e.g., appending it to some byte[]), not if it's just some integral value, like it is here. But I guess it doesn't really matter at the end of the day.

* @return the subspace containing the data, which is guaranteed to be non-null
*/
@Nonnull
Subspace getDataSubspace();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a default method because the AbstractStorageAdapter caches the result of constructing the data subspace from the subspace? We'd still expect all of the implementations to consistently return the same value for the data subspace, right?

I assume the data subspace is the most important one to cache because it is accessed the most. Do you think we'd want to cache the other two as well?

Copy link
Contributor Author

@normen662 normen662 Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to add those as well. But the other subspaces are only used in a static context as these subspaces are accessed outside the concept of a layer.

/**
* Subspace for data.
*/
byte SUBSPACE_PREFIX_DATA = 0x01;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure how much this matters, but the encoding for 1 is two bytes long (\x15\x01), whereas the encoding for 0 is only one byte (\x14). We may be able to keep the keys slightly shorter if we make SUBSPACE_PREFIX_DATA be zero instead of SUBSPACE_PREFIX_INDEX_ACCESS_INFO. The value null (and false and true) also all have one byte tuple codes, if we like those better for some reason.

All of this is a 1 byte (per key) thing, and I'm not sure it matters as much on RocksDB as it did on sqlite, as RocksDB compression should be able to compress away common prefixes anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted data and ACCESS_INFO, DATA is now 0x00.

* @param maxNumRead the maximum number of nodes to return in this scan
* @return an {@link Iterable} that provides the nodes found in the specified layer range
*/
Iterable<Node<N>> scanLayer(@Nonnull ReadTransaction readTransaction, int layer, @Nullable Tuple lastPrimaryKey,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an AsyncIterable? Or should it return a CompletableFuture<Collection<Node<N>>> or something like that, and then it always reads (up to) maxNumRead and returns them in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can make that an AsyncIterable. While more idiomatic, it doesn't buy us anything as the caller is just going to iterate over this in a non-asynchronous way anyway. But it's definitely an improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a regular Iterable as it is fine for debugging and the code in InliningStorageAdapter needs to reduce from the neighbors to the nodes which needs some more complicated logic there.

neighborsBuilder.add(neighbor);
}

// there may be a rest; throw it away
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not have the problem that the last node in the layer will never get added to the list? The idea being that because it always ignores the final contents of neighborsBuilder (because it doesn't know if there are more neighbors or not), it will never return the final node in the layer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code driving this will continue executing this until there is no new node to be read anymore. So if you throw away some entries here, the next one will get it. The assumption, which I think is a fair one (also given that this is debug code) is that a batch size can at least accommodate one node's data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what about the very last node in the layer? As far as I can tell, the only way it knows to return data is it finds a key that is not part of the node, but for the very last node in the layer, no such key exists.

@Nonnull final NeighborsChangeSet<NodeReference> neighborsChangeSet) {
final byte[] key = getDataSubspace().pack(Tuple.from(layer, node.getPrimaryKey()));

final List<Object> nodeItems = Lists.newArrayListWithExpectedSize(3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another serialization form that we'd probably use protobuf for if protobuf was available to us. We don't really gain anything by using tuples here, as we don't need the values to be sorted. There's some amount of overhead to tuples that comes from the fact that their serialization needs to preserve order, but not that much. It's probably fine to continue using tuples unless we wanted to use something like protobuf for some other kind of benefit (like a clearer evolution path)

@Nonnull final RealVector queryVector) {
final Set<Tuple> visited = Sets.newConcurrentHashSet(NodeReference.primaryKeys(entryNeighbors));
final Queue<NodeReferenceWithDistance> candidates =
new PriorityBlockingQueue<>(config.getM(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's true that candidates is not bound by efSearch because we don't trim the candidates when we add a new one. But we only add a new one if it's in the historical top efSearch, so...I'm not convinced that M is a better guess here for the initial capacity. But as you say, it is only a guess for the initial capacity, so it may not matter

* @param maxNumRead the maximum number of nodes to return in this scan
* @return an {@link Iterable} that provides the nodes found in the specified layer range
*/
Iterable<AbstractNode<N>> scanLayer(@Nonnull ReadTransaction readTransaction, int layer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an AsyncIterable? The implementation seem to actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in InliningStorageAdapter actually produces a regular Iterable which is tricky to make without creating an asynchronous reduce operator on AsyncIterable.

@normen662 normen662 force-pushed the hnsw-on-linear branch 3 times, most recently from 4bd5e3e to 3bd17bc Compare November 2, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants