Skip to content

Conversation

@bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Oct 29, 2025

While working on #15805, I noticed the current cpy_flt kernel has significant uncoalesced global access.
cpy_op

This is particularly bad if one tries to make a transposed tensor contiguous by cur = ggml_cont(ctx, ggml_transpose(ctx, cur));. Some simple benchmarks in test-bankend-ops: perf -o CPY on 4090

master using permute to simulate transpose

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                132 runs -  7642.58 us/run -  1572864 kB/run -  200.73 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                172 runs -  6662.13 us/run -   786432 kB/run -  113.89 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                344 runs -  2966.67 us/run -   786432 kB/run -  255.75 GB/s

Using shared memory is a common way to make coalesced global memory access. I implemented another copy kernel and get 3x-4x boost.
This PR with src tensor transposed

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                484 runs -  2119.14 us/run -  1572864 kB/run -  723.92 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                817 runs -  1276.55 us/run -   786432 kB/run -  594.35 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):               1118 runs -   917.03 us/run -   786432 kB/run -  827.37 GB/s

Currently I use src tensor's OP == TRANSPOSE to let ggml_cpy_flt_cuda pick customized transposed copy. I am not sure if there is a better way to make this choice. Your suggestions are welcome.

…void uncoalesced access; test cases also added shwoing improved memory bandwidth
@bssrdf bssrdf requested a review from slaren as a code owner October 29, 2025 12:54
@CISC
Copy link
Collaborator

CISC commented Oct 29, 2025

Any reason for not including BF16 as well?

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 29, 2025

Any reason for not including BF16 as well?

Just added support for BF16. Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

@CISC
Copy link
Collaborator

CISC commented Oct 29, 2025

Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

Probably only relevant for quantized KV cache (perhaps not even then, unsure), so not a big issue.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@am17an
Copy link
Collaborator

am17an commented Oct 29, 2025

@bssrdf this is failing the CI, do you mind taking a look?

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 29, 2025

@bssrdf this is failing the CI, do you mind taking a look?

Yeah, something is wrong. Using the OP equality to make a decision is likely not robust. Will look into it. Thanks.

Edit: I ran ci/run.sh locally on my machine: wsl2 ubuntu. The rerank test passed.

rerank score 0:    0.171
rerank score 1:    0.169
rerank score 2:    0.189

0.00.608.024 I llama_perf_context_print:        load time =     321.37 ms
0.00.608.025 I llama_perf_context_print: prompt eval time =     105.71 ms /    62 tokens (    1.70 ms per token,   586.52 tokens per second)
0.00.608.026 I llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
0.00.608.027 I llama_perf_context_print:       total time =     106.26 ms /    63 tokens
0.00.608.027 I llama_perf_context_print:    graphs reused =          0

real    0m0.864s
user    0m0.399s
sys     0m0.355s
  - rerank score 0 @ 0.171 OK
  - rerank score 1 @ 0.169 OK
  - rerank score 2 @ 0.189 OK

I am not sure what 's going on. The 3 score are the same, but it passed on my machine and failed at CI.

@am17an
Copy link
Collaborator

am17an commented Oct 30, 2025

Try running it through compute-sanitizer

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of memory bandwidth for FP16 and BF16, 32*2=64 bytes is still suboptimal vs. the 64*2=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 30, 2025

In terms of memory bandwidth for FP16 and BF16, 322=64 bytes is still suboptimal vs. the 642=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

@JohannesGaessler, thanks for the comments. I realized the logic of triggering transposed copy is not right. I am working on a more general way of copying. The bottom line is to avoid uncoalesced access as much as possible, which really reduces the bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants