【Hackathon 9th No.93】Add Minimax-m1 for FastDeploy 非对齐版 #4629

ZhijunLStudio · 2025-10-28T12:31:32Z

背景

本次 Pull Request 的目标是为 FastDeploy 新增对 MiniMax-M1 模型的初步支持。该模型采用了一种混合架构，结合了线性和注意力（Linear Attention）层与标准的分组查询注意力（Grouped-Query Attention, GQA）层。本次提交包含了模型的定义、用于线性和注意力的 Triton 自定义核函数，以及在 Model Runner 中的集成。

目前的主要工作是完成模型的功能实现，并与 vLLM 的实现进行精度对齐。

修改内容

模型定义: 新增 fastdeploy/model_executor/models/minimax_m1.py 文件，用于定义模型结构和前向传播逻辑。
自定义 Kernel: 在 ops/triton_ops/ 目录下新增了 minimax_mamba_ops.py 和 minimax_mamba_kernels.py，用于支持类 Mamba 结构的线性和注意力机制。
RoPE 集成: 修改了 rotary_embedding.py，为 MiniMax-M1 模型中的 GQA 层应用 GLM 风格的旋转位置编码（RoPE）。
状态缓存: 更新了 gpu_model_runner.py 和 forward_meta.py，以支持和管理线性和注意力层所需的状态缓存（linear_attn_caches）。
配置更新: 修改了 config.py，加入了模型相关的配置项。

使用方法

模型在单机 8 卡环境下进行测试。

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/home/aistudio/config_folder"

# 超参设置
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=8, load_choices="default_v1")
output = llm.generate(prompts="who are you？", use_tqdm=True, sampling_params=sampling_params)

print(output)

精度测试

逐层与 vLLM 进行精度对齐。该模型共有 80 层，本次调试重点关注前 8 层（7 层线性和注意力 + 1 层 GQA）。

第一部分：线性和注意力层（0-7层）精度对齐

前 7 个线性和注意力层的输出与 vLLM 表现出高度的精度一致性。以下是第 7 层（最后一个GQA）的日志，证明了在注意力计算之前，Q、K、V 张量的数值与 vLLM 基本吻合。

第 7 层：RoPE/Attention 前的 QKV 张量对比

框架	张量名称	均值	标准差	形状	备注
FastDeploy	`After_QKV_Proj_Combined`	`-0.092525`	`1.627345`	`[4, 2560]`	✅ 精度对齐
vLLM	`After_QKV_Proj_Combined`	`-0.092362`	`1.624597`	`[4, 2560]`	(基准)
FastDeploy	`Q_BeforeRoPE`	`-0.096660`	`1.136759`	`[4, 2048]`	✅ 精度对齐
vLLM	`Q_BeforeRoPE`	`-0.096524`	`1.135294`	`[4, 2048]`	(基准)
FastDeploy	`K_BeforeRoPE`	`-0.151537`	`4.017039`	`[4, 256]`	✅ 精度对齐
vLLM	`K_BeforeRoPE`	`-0.151061`	`4.009206`	`[4, 256]`	(基准)

这证实了线性和注意力的实现以及之前所有层的计算是正确的。

第二部分：GQA 层（第8层）精度不一致

问题出现在第 8 层，这是模型中的第一个 GQA 层。

问题描述: 在第 8 层的 QKV 投影（Projection）之后，张量的值是正确的、非零的。然而，在对 Q 和 K 张量应用 GLM 风格的旋转位置编码（RoPE）后，这两个张量的值全部变成了 0。这直接导致了后续注意力计算的输出错误。
初步推断: 问题很可能出在 GlmRotaryEmbedding 的具体实现逻辑中，当它被应用于 MiniMax-M1 的 GQA 层的特定条件下时出现了错误。进入 RoPE 函数的输入张量是正确的，但输出不正确。

paddle-bot · 2025-10-28T12:31:40Z

Thanks for your contribution!

paddle-bot bot added the contributor External developers label Oct 28, 2025

luotao1 mentioned this pull request Oct 29, 2025

【Hackathon 9th】开源贡献个人挑战赛 PaddlePaddle/Paddle#74773

Open

ZhijunLStudio added 10 commits October 31, 2025 16:21

first commit

40268db

remove large file

bec4fbe

remove some print

3b64896

remove rotary print

b6516b6

add 2_After_Attention print

e7ec2ab

Delete Chinese comments

a2598e9

modify model_name_or_path

6997389

Delete a space

92c71d4

Executable code

891544c

Temporary code storage

e612bd6

ZhijunLStudio force-pushed the minimax-1023 branch from f41fd23 to e612bd6 Compare October 31, 2025 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

【Hackathon 9th No.93】Add Minimax-m1 for FastDeploy 非对齐版 #4629

【Hackathon 9th No.93】Add Minimax-m1 for FastDeploy 非对齐版 #4629

ZhijunLStudio commented Oct 28, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

【Hackathon 9th No.93】Add Minimax-m1 for FastDeploy 非对齐版 #4629

Are you sure you want to change the base?

【Hackathon 9th No.93】Add Minimax-m1 for FastDeploy 非对齐版 #4629

Conversation

ZhijunLStudio commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

背景

修改内容

使用方法

精度测试

第一部分：线性和注意力层（0-7层）精度对齐

第二部分：GQA 层（第8层）精度不一致

Uh oh!

paddle-bot bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhijunLStudio commented Oct 28, 2025 •

edited

Loading