Skip to content

Conversation

@Acly
Copy link
Collaborator

@Acly Acly commented Oct 29, 2025

This moves buffer/uma/offset/size computations into a function, which reduces amount of repetitive code in ggml_vk_op_f32 a lot. There are more places outside of vk_op_f32 where it can be reused, I didn't adapt those yet.

I replicated the existing logic, but don't really understand why the branching is needed for sub-buffer size computation. From what I can see and test it doesn't change the result, maybe this can be simplified? Or there is some corner case I can't think of? (see questions in comments)

size_t size;
if (support_incontiguous) {
size = ggml_nbytes(tensor) + misalign_bytes;
if (offset + size >= buffer->size) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replicates the branch here https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L8620-L8631

Why is this needed? The case offset+size > buffer->size sounds like it should never happen. In the case where they're equal, the branch computes the same value that's already there.

It might clamp the size to maxStorageBufferRange but that seems like it would likely produce wrong results and should rather assert/abort?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of this logic is only here for historical reasons at this point. Early on, some shaders didn't have thorough bounds checking logic and relied on robustBufferAccess and these tight bounds to accomplish bounds checking. But robustBufferAccess isn't as reliable as it sounds, and we've improved the bounds checking logic over time.

To my knowledge, the only remaining shader that intentionally relies on robustBufferAccess bounds checking is the scalar/coopmat1 mul_mm shaders (not sure about mmq, but probably that too). These need to do the A/B loads in a way that allows the compiler to batch them, so manual bounds checking would probably interfere and cause a slowdown.

So, I think all of the current users of ggml_vk_op_f32 (and everything else except these mul_mm shaders) could just use ggml_vk_get_max_buffer_range.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I created a lot of that code for a very different version of the backend, and also when I knew a lot less about Vulkan than I do now. It's very possible there are checks and branches that are no longer needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I simplified the size to just ggml_nbytes(tensor) + misalign_bytes - this should be the actual bounds, and what the other branches ended up computing anyway.

Using ggml_vk_get_max_buffer_range also works, but feels less straight forward and it's still unclear to me if there's a reason to prefer it.

I now adapted all ops except for the mul_mat variants. They use more complex size computation. Also I want to avoid conflicts with #16868

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_vk_get_max_buffer_range is necessary for some operations when the size of the tensor is greater than 4GB (or greater than the maxBufferRange) and would lead to an invalid descriptor range being set. Please try enabling these ifdefed out tests:

#if 0
    // >4GB im2col destination. Too slow to run by default.
    // Test cases taken from Wan2.1 T2V 1.3B.
...
#if 0
    // > 4GB A matrix. Too slow to be enabled by default.
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that's what this line was for:

    } else if (op == GGML_OP_IM2COL || op == GGML_OP_IM2COL_3D) {
        if (ctx->device->shader_int64 && ctx->device->buffer_device_address) {
            // buffer device address path doesn't use dst buffer
            d_sz = 1;
        }

(and mul_mat I didn't change so far)

IMO it would be even better to not have the binding in the shader at all (put it in a #else). Why is it there if it cannot be bound properly and isn't used? Instead of setting d_sz = 1 it could just do a dispatch right there without the dst buffer.

Good hint about having to enable those tests though, I hadn't done that. Unfortunately they just OOM on my 12GB GPU, I will find some way to check them.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@Acly Acly force-pushed the vulkan-op-subbuffer branch from 09475f8 to edacb52 Compare October 31, 2025 14:23
@Acly Acly marked this pull request as ready for review October 31, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants