llama: Fused QKV multiplication #16813

am17an · 2025-10-28T08:37:39Z

This is a draft PR for comments about QKV fusion. Performance wise I checked CUDA it seems to be roughly ~4-5% performance gain (in PP and TG) for qwen3 models.

This tries to merge the weights together to form a single GEMM, rather than 3 separate ones. The main drawback of this approach is that there might be other tensors in between the Q,K,V weights (like LoRAs) which would break mmap (as @slaren pointed out). However, doing this on the backend is a much more involved change which might not worth the benefit. For now I'm thinking the best way forward be

Add the wqkv, bqkv tensors to future GGUFs using an option in convert_hf_to_gguf.py
In case the user wants and knows the weights are together, a flag --fuse-qkv-weights can be provided to take this path.

IMO the QKV weights should always live in the GGUF file together without exceptions as they're naturally used together.

Some performance numbers on a 4090:

Without fusion

model	size	params	backend	ngl	test	t/s
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp512	3846.64 ± 15.30
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp1024	5651.13 ± 21.24
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp2048	7024.90 ± 14.43
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp4096	7504.26 ± 9.66
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	tg128	151.89 ± 0.10

With fusion

model	size	params	backend	ngl	test	t/s
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp512	3933.52 ± 20.74
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp1024	5768.19 ± 19.00
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp2048	7430.25 ± 14.64
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	pp4096	8022.90 ± 13.50
qwen3moe 30B.A3B Q4_0	16.11 GiB	30.53 B	CUDA	99	tg128	159.59 ± 0.20

This PR adds fused QKV multiplication

jeffbolznv · 2025-10-28T14:40:38Z

I pulled this change but I don't think I'm seeing any change in the output from ggml_backend_sched_print_assignments, so I don't think it's kicking in. I'm using Qwen_Qwen3-30B-A3B-Q2_K.gguf. Is there anything I need to do to enable it?

am17an · 2025-10-28T15:01:36Z

@jeffbolznv I only tested qwen3-30ba3b q4_0, and qwen3-8b q4_0, let me try with a q2k quant

jeffbolznv · 2025-10-28T16:10:10Z

I grabbed the Q4_0 model and I do see the combined weights with it. I think the Q2_K model may not have had matching types for the weights.

For the Vulkan backend, I see maybe a couple percent gain for pp512, maybe less than a percent gain for tg128 (not sure if it's just noise). I think Vulkan is already getting most of the benefits of this from reordering and scheduling optimizations in the backend, and those optimizations work even with mismatching types. I haven't enabled all of those optimizations for pp (I initially saw some slowdowns and didn't pursue it further).

am17an · 2025-10-28T16:23:48Z

@jeffbolznv currently the CUDA backend does not do scheduling the way Vulkan does. From what I understand we would need to create separate streams in CUDA graphs and we currently only do 1 stream (unless doing -sm row). Long term that should be the goal for the CUDA backend as well, I think @slaren is working on that general refactor.

Nevertheless, this PR would be useful everywhere where we don't have graph optimizations yet, although a couple percent of PP is also not undesirable :)

ggerganov · 2025-10-29T08:43:23Z

On M2 Ultra I don't see much difference from this change (it's not better if anything):

./scripts/compare-commits.sh master pr/16813 llama-bench -m ./models/qwen3-30b-a3b-coder/ggml-model-q4_0.gguf -m ./models/qwen3-8b-base/ggml-model-f16.gguf -t 1 -fa 1 -b 16384 -ub 2048 -p 512,2048 -n 64 -mmp 0

Model	Test	t/s master	t/s pr/16813	Speedup
qwen3 8B F16	pp512	1521.59	1522.98	1.00
qwen3 8B F16	pp2048	1557.07	1556.93	1.00
qwen3 8B F16	tg64	43.05	42.90	1.00
qwen3moe 30B.A3B Q4_0	pp512	2194.78	2215.86	1.01
qwen3moe 30B.A3B Q4_0	pp2048	2529.42	2524.54	1.00
qwen3moe 30B.A3B Q4_0	tg64	101.95	100.06	0.98

The Metal backend does have the graph optimizaion for making the Q, K and V multiplication run concurrently, so it's likely expected to not see gains here.

Add the wqkv, bqkv tensors to future GGUFs using an option in convert_hf_to_gguf.py

This seems like the better approach.

am17an · 2025-10-29T10:14:37Z

Actually from what I see, a lot of quants have different weights for Q, K and V. So it might not be super beneficial to do this anyway

Fused QKV multiplication

c41bcda

This PR adds fused QKV multiplication

am17an force-pushed the fused-qkv branch from 488b649 to c41bcda Compare October 28, 2025 10:08

JohannesGaessler self-requested a review October 28, 2025 10:25

am17an changed the title ~~Fused QKV multiplication~~ llama: Fused QKV multiplication Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama: Fused QKV multiplication #16813

llama: Fused QKV multiplication #16813

am17an commented Oct 28, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Oct 28, 2025

Uh oh!

am17an commented Oct 28, 2025

Uh oh!

jeffbolznv commented Oct 28, 2025

Uh oh!

am17an commented Oct 28, 2025

Uh oh!

ggerganov commented Oct 29, 2025

Uh oh!

am17an commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama: Fused QKV multiplication #16813

Are you sure you want to change the base?

llama: Fused QKV multiplication #16813

Conversation

am17an commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Oct 28, 2025

Uh oh!

am17an commented Oct 28, 2025

Uh oh!

jeffbolznv commented Oct 28, 2025

Uh oh!

am17an commented Oct 28, 2025

Uh oh!

ggerganov commented Oct 29, 2025

Uh oh!

am17an commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Oct 28, 2025 •

edited

Loading