Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Oct 28, 2025

This is a draft PR for comments about QKV fusion. Performance wise I checked CUDA it seems to be roughly ~4-5% performance gain (in PP and TG) for qwen3 models.

This tries to merge the weights together to form a single GEMM, rather than 3 separate ones. The main drawback of this approach is that there might be other tensors in between the Q,K,V weights (like LoRAs) which would break mmap (as @slaren pointed out). However, doing this on the backend is a much more involved change which might not worth the benefit. For now I'm thinking the best way forward be

  1. Add the wqkv, bqkv tensors to future GGUFs using an option in convert_hf_to_gguf.py
  2. In case the user wants and knows the weights are together, a flag --fuse-qkv-weights can be provided to take this path.

IMO the QKV weights should always live in the GGUF file together without exceptions as they're naturally used together.

Some performance numbers on a 4090:

Without fusion

model size params backend ngl test t/s
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp512 3846.64 ± 15.30
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp1024 5651.13 ± 21.24
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp2048 7024.90 ± 14.43
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp4096 7504.26 ± 9.66
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 tg128 151.89 ± 0.10

With fusion

model size params backend ngl test t/s
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp512 3933.52 ± 20.74
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp1024 5768.19 ± 19.00
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp2048 7430.25 ± 14.64
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 pp4096 8022.90 ± 13.50
qwen3moe 30B.A3B Q4_0 16.11 GiB 30.53 B CUDA 99 tg128 159.59 ± 0.20

This PR adds fused QKV multiplication
@jeffbolznv
Copy link
Collaborator

I pulled this change but I don't think I'm seeing any change in the output from ggml_backend_sched_print_assignments, so I don't think it's kicking in. I'm using Qwen_Qwen3-30B-A3B-Q2_K.gguf. Is there anything I need to do to enable it?

@am17an
Copy link
Collaborator Author

am17an commented Oct 28, 2025

@jeffbolznv I only tested qwen3-30ba3b q4_0, and qwen3-8b q4_0, let me try with a q2k quant

@jeffbolznv
Copy link
Collaborator

I grabbed the Q4_0 model and I do see the combined weights with it. I think the Q2_K model may not have had matching types for the weights.

For the Vulkan backend, I see maybe a couple percent gain for pp512, maybe less than a percent gain for tg128 (not sure if it's just noise). I think Vulkan is already getting most of the benefits of this from reordering and scheduling optimizations in the backend, and those optimizations work even with mismatching types. I haven't enabled all of those optimizations for pp (I initially saw some slowdowns and didn't pursue it further).

@am17an
Copy link
Collaborator Author

am17an commented Oct 28, 2025

@jeffbolznv currently the CUDA backend does not do scheduling the way Vulkan does. From what I understand we would need to create separate streams in CUDA graphs and we currently only do 1 stream (unless doing -sm row). Long term that should be the goal for the CUDA backend as well, I think @slaren is working on that general refactor.

Nevertheless, this PR would be useful everywhere where we don't have graph optimizations yet, although a couple percent of PP is also not undesirable :)

@am17an am17an changed the title Fused QKV multiplication llama: Fused QKV multiplication Oct 29, 2025
@ggerganov
Copy link
Member

On M2 Ultra I don't see much difference from this change (it's not better if anything):

./scripts/compare-commits.sh master pr/16813 llama-bench -m ./models/qwen3-30b-a3b-coder/ggml-model-q4_0.gguf -m ./models/qwen3-8b-base/ggml-model-f16.gguf -t 1 -fa 1 -b 16384 -ub 2048 -p 512,2048 -n 64 -mmp 0
Model Test t/s master t/s pr/16813 Speedup
qwen3 8B F16 pp512 1521.59 1522.98 1.00
qwen3 8B F16 pp2048 1557.07 1556.93 1.00
qwen3 8B F16 tg64 43.05 42.90 1.00
qwen3moe 30B.A3B Q4_0 pp512 2194.78 2215.86 1.01
qwen3moe 30B.A3B Q4_0 pp2048 2529.42 2524.54 1.00
qwen3moe 30B.A3B Q4_0 tg64 101.95 100.06 0.98

The Metal backend does have the graph optimizaion for making the Q, K and V multiplication run concurrently, so it's likely expected to not see gains here.

  1. Add the wqkv, bqkv tensors to future GGUFs using an option in convert_hf_to_gguf.py

This seems like the better approach.

@am17an
Copy link
Collaborator Author

am17an commented Oct 29, 2025

Actually from what I see, a lot of quants have different weights for Q, K and V. So it might not be super beneficial to do this anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants