- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
llama: Fused QKV multiplication #16813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This PR adds fused QKV multiplication
| I pulled this change but I don't think I'm seeing any change in the output from ggml_backend_sched_print_assignments, so I don't think it's kicking in. I'm using Qwen_Qwen3-30B-A3B-Q2_K.gguf. Is there anything I need to do to enable it? | 
| @jeffbolznv I only tested  | 
| I grabbed the Q4_0 model and I do see the combined weights with it. I think the Q2_K model may not have had matching types for the weights. For the Vulkan backend, I see maybe a couple percent gain for pp512, maybe less than a percent gain for tg128 (not sure if it's just noise). I think Vulkan is already getting most of the benefits of this from reordering and scheduling optimizations in the backend, and those optimizations work even with mismatching types. I haven't enabled all of those optimizations for pp (I initially saw some slowdowns and didn't pursue it further). | 
| @jeffbolznv currently the CUDA backend does not do scheduling the way Vulkan does. From what I understand we would need to create separate streams in CUDA graphs and we currently only do 1 stream (unless doing  Nevertheless, this PR would be useful everywhere where we don't have graph optimizations yet, although a couple percent of PP is also not undesirable :) | 
| On M2 Ultra I don't see much difference from this change (it's not better if anything): ./scripts/compare-commits.sh master pr/16813 llama-bench -m ./models/qwen3-30b-a3b-coder/ggml-model-q4_0.gguf -m ./models/qwen3-8b-base/ggml-model-f16.gguf -t 1 -fa 1 -b 16384 -ub 2048 -p 512,2048 -n 64 -mmp 0
 The Metal backend does have the graph optimizaion for making the Q, K and V multiplication run concurrently, so it's likely expected to not see gains here. 
 This seems like the better approach. | 
| Actually from what I see, a lot of quants have different weights for Q, K and V. So it might not be super beneficial to do this anyway | 
This is a draft PR for comments about QKV fusion. Performance wise I checked CUDA it seems to be roughly ~4-5% performance gain (in PP and TG) for qwen3 models.
This tries to merge the weights together to form a single GEMM, rather than 3 separate ones. The main drawback of this approach is that there might be other tensors in between the Q,K,V weights (like LoRAs) which would break mmap (as @slaren pointed out). However, doing this on the backend is a much more involved change which might not worth the benefit. For now I'm thinking the best way forward be
convert_hf_to_gguf.py--fuse-qkv-weightscan be provided to take this path.IMO the QKV weights should always live in the GGUF file together without exceptions as they're naturally used together.
Some performance numbers on a 4090:
Without fusion
With fusion