- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
improve CUDA cpy memory bandwidth when copying transposed tensor #16841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…void uncoalesced access; test cases also added shwoing improved memory bandwidth
| Any reason for not including  | 
| 
 Just added support for BF16. Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc. | 
| 
 Probably only relevant for quantized KV cache (perhaps not even then, unsure), so not a big issue. | 
| @bssrdf this is failing the CI, do you mind taking a look? | 
| 
 Yeah, something is wrong. Using the OP equality to make a decision is likely not robust. Will look into it. Thanks. Edit: I ran  I am not sure what 's going on. The 3 score are the same, but it passed on my machine and failed at CI. | 
| Try running it through compute-sanitizer | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of memory bandwidth for FP16 and BF16, 32*2=64 bytes is still suboptimal vs. the 64*2=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.
| 
 @JohannesGaessler, thanks for the comments. I realized the logic of triggering transposed copy is not right. I am working on a more general way of copying. The bottom line is to avoid uncoalesced access as much as possible, which really reduces the bandwidth. | 
While working on #15805, I noticed the current

cpy_fltkernel has significant uncoalesced global access.This is particularly bad if one tries to make a transposed tensor contiguous by
cur = ggml_cont(ctx, ggml_transpose(ctx, cur));. Some simple benchmarks in test-bankend-ops: perf -o CPY on 4090master using permute to simulate transpose
Using shared memory is a common way to make coalesced global memory access. I implemented another copy kernel and get 3x-4x boost.
This PR with src tensor transposed
Currently I use src tensor's OP == TRANSPOSE to let
ggml_cpy_flt_cudapick customized transposed copy. I am not sure if there is a better way to make this choice. Your suggestions are welcome.