- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
vulkan : refactor buffer handling in vk_op_f32 #16840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
        
          
                ggml/src/ggml-vulkan/ggml-vulkan.cpp
              
                Outdated
          
        
      | size_t size; | ||
| if (support_incontiguous) { | ||
| size = ggml_nbytes(tensor) + misalign_bytes; | ||
| if (offset + size >= buffer->size) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replicates the branch here https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L8620-L8631
Why is this needed? The case offset+size > buffer->size sounds like it should never happen. In the case where they're equal, the branch computes the same value that's already there.
It might clamp the size to maxStorageBufferRange but that seems like it would likely produce wrong results and should rather assert/abort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of this logic is only here for historical reasons at this point. Early on, some shaders didn't have thorough bounds checking logic and relied on robustBufferAccess and these tight bounds to accomplish bounds checking. But robustBufferAccess isn't as reliable as it sounds, and we've improved the bounds checking logic over time.
To my knowledge, the only remaining shader that intentionally relies on robustBufferAccess bounds checking is the scalar/coopmat1 mul_mm shaders (not sure about mmq, but probably that too). These need to do the A/B loads in a way that allows the compiler to batch them, so manual bounds checking would probably interfere and cause a slowdown.
So, I think all of the current users of ggml_vk_op_f32 (and everything else except these mul_mm shaders) could just use ggml_vk_get_max_buffer_range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I created a lot of that code for a very different version of the backend, and also when I knew a lot less about Vulkan than I do now. It's very possible there are checks and branches that are no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I simplified the size to just ggml_nbytes(tensor) + misalign_bytes - this should be the actual bounds, and what the other branches ended up computing anyway.
Using ggml_vk_get_max_buffer_range also works, but feels less straight forward and it's still unclear to me if there's a reason to prefer it.
I now adapted all ops except for the mul_mat variants. They use more complex size computation. Also I want to avoid conflicts with #16868
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml_vk_get_max_buffer_range is necessary for some operations when the size of the tensor is greater than 4GB (or greater than the maxBufferRange) and would lead to an invalid descriptor range being set. Please try enabling these ifdefed out tests:
#if 0
    // >4GB im2col destination. Too slow to run by default.
    // Test cases taken from Wan2.1 T2V 1.3B.
...
#if 0
    // > 4GB A matrix. Too slow to be enabled by default.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that's what this line was for:
    } else if (op == GGML_OP_IM2COL || op == GGML_OP_IM2COL_3D) {
        if (ctx->device->shader_int64 && ctx->device->buffer_device_address) {
            // buffer device address path doesn't use dst buffer
            d_sz = 1;
        }
(and mul_mat I didn't change so far)
IMO it would be even better to not have the binding in the shader at all (put it in a #else). Why is it there if it cannot be bound properly and isn't used? Instead of setting d_sz = 1 it could just do a dispatch right there without the dst buffer.
Good hint about having to enable those tests though, I hadn't done that. Unfortunately they just OOM on my 12GB GPU, I will find some way to check them.
09475f8    to
    edacb52      
    Compare
  
    
This moves buffer/uma/offset/size computations into a function, which reduces amount of repetitive code in
ggml_vk_op_f32a lot. There are more places outside of vk_op_f32 where it can be reused, I didn't adapt those yet.I replicated the existing logic, but don't really understand why the branching is needed for sub-buffer size computation. From what I can see and test it doesn't change the result, maybe this can be simplified? Or there is some corner case I can't think of? (see questions in comments)