improve CUDA cpy memory bandwidth when copying transposed tensor #16841

bssrdf · 2025-10-29T12:54:17Z

While working on #15805, I noticed the current cpy_flt kernel has significant uncoalesced global access.

This is particularly bad if one tries to make a transposed tensor contiguous by cur = ggml_cont(ctx, ggml_transpose(ctx, cur));. Some simple benchmarks in test-bankend-ops: perf -o CPY on 4090

master using permute to simulate transpose

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                132 runs -  7642.58 us/run -  1572864 kB/run -  200.73 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                172 runs -  6662.13 us/run -   786432 kB/run -  113.89 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                344 runs -  2966.67 us/run -   786432 kB/run -  255.75 GB/s

Using shared memory is a common way to make coalesced global memory access. I implemented another copy kernel and get 3x-4x boost.
This PR with src tensor transposed

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                484 runs -  2119.14 us/run -  1572864 kB/run -  723.92 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                817 runs -  1276.55 us/run -   786432 kB/run -  594.35 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):               1118 runs -   917.03 us/run -   786432 kB/run -  827.37 GB/s

Currently I use src tensor's OP == TRANSPOSE to let ggml_cpy_flt_cuda pick customized transposed copy. I am not sure if there is a better way to make this choice. Your suggestions are welcome.

…void uncoalesced access; test cases also added shwoing improved memory bandwidth

CISC · 2025-10-29T13:03:50Z

Any reason for not including BF16 as well?

bssrdf · 2025-10-29T13:29:05Z

Any reason for not including BF16 as well?

Just added support for BF16. Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

CISC · 2025-10-29T13:41:57Z

Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

Probably only relevant for quantized KV cache (perhaps not even then, unsure), so not a big issue.

am17an · 2025-10-29T17:23:49Z

@bssrdf this is failing the CI, do you mind taking a look?

bssrdf · 2025-10-29T17:30:47Z

@bssrdf this is failing the CI, do you mind taking a look?

Yeah, something is wrong. Using the OP equality to make a decision is likely not robust. Will look into it. Thanks.

Edit: I ran ci/run.sh locally on my machine: wsl2 ubuntu. The rerank test passed.

rerank score 0:    0.171
rerank score 1:    0.169
rerank score 2:    0.189

0.00.608.024 I llama_perf_context_print:        load time =     321.37 ms
0.00.608.025 I llama_perf_context_print: prompt eval time =     105.71 ms /    62 tokens (    1.70 ms per token,   586.52 tokens per second)
0.00.608.026 I llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
0.00.608.027 I llama_perf_context_print:       total time =     106.26 ms /    63 tokens
0.00.608.027 I llama_perf_context_print:    graphs reused =          0

real    0m0.864s
user    0m0.399s
sys     0m0.355s
  - rerank score 0 @ 0.171 OK
  - rerank score 1 @ 0.169 OK
  - rerank score 2 @ 0.189 OK

I am not sure what 's going on. The 3 score are the same, but it passed on my machine and failed at CI.

am17an · 2025-10-30T04:02:39Z

Try running it through compute-sanitizer

JohannesGaessler

In terms of memory bandwidth for FP16 and BF16, 32*2=64 bytes is still suboptimal vs. the 64*2=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

bssrdf · 2025-10-30T16:02:09Z

In terms of memory bandwidth for FP16 and BF16, 322=64 bytes is still suboptimal vs. the 642=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

@JohannesGaessler, thanks for the comments. I realized the logic of triggering transposed copy is not right. I am working on a more general way of copying. The bottom line is to avoid uncoalesced access as much as possible, which really reduces the bandwidth.

bssrdf added 2 commits October 28, 2025 22:36

WIP

5afac4d

added a cpy kernel specific to transposed tensor which uses smem to a…

30d4607

…void uncoalesced access; test cases also added shwoing improved memory bandwidth

bssrdf requested a review from slaren as a code owner October 29, 2025 12:54

added BF16 support

d3bdcf8

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025

more strict check to make sure src0 is a transpose

18818a2

JohannesGaessler reviewed Oct 30, 2025

View reviewed changes

bssrdf and others added 7 commits October 30, 2025 15:20

reformulated to handle more complicated transpose cases

35daa02

merged ok

29387ce

bring back 2D transpose for higher performance

d2ec251

allow build on windows

3809645

tranpose copy more shapes

c36b70b

minor tweak

90fd992

final clean up

d49232c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve CUDA cpy memory bandwidth when copying transposed tensor #16841

improve CUDA cpy memory bandwidth when copying transposed tensor #16841

bssrdf commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

bssrdf commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

am17an commented Oct 29, 2025

Uh oh!

bssrdf commented Oct 29, 2025 •

edited

Loading

Uh oh!

am17an commented Oct 30, 2025

Uh oh!

JohannesGaessler left a comment •

edited

Loading

Uh oh!

bssrdf commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

improve CUDA cpy memory bandwidth when copying transposed tensor #16841

Are you sure you want to change the base?

improve CUDA cpy memory bandwidth when copying transposed tensor #16841

Conversation

bssrdf commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

bssrdf commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

am17an commented Oct 29, 2025

Uh oh!

bssrdf commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Oct 30, 2025

Uh oh!

JohannesGaessler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bssrdf commented Oct 29, 2025 •

edited

Loading

JohannesGaessler left a comment •

edited

Loading