-
Notifications
You must be signed in to change notification settings - Fork 355
Add NPU (Ascend) backend support for INT4 weight-only quantization workflow #3172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3172
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 5 New FailuresAs of commit ca8f056 with merge base ca99c1c ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| try: | ||
| import torch_npu | ||
| except ImportError: | ||
| torch_npu = None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch provide Autoload mechinasm, so we do not need to import it explicitly.
| @unittest.skipIf(torch_npu is None, "torch_npu is not available") | ||
| @unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @unittest.skipIf(torch_npu is None, "torch_npu is not available") | |
| @unittest.skipIf(not torch_npu.npu.is_available(), "NPU not available") | |
| @unittest.skipIf(torch.accelerator.current_accelerator(True).type == "npu" and torch.accelerator.is_available(), "NPU not available") |
| @unittest.skipIf( | ||
| version.parse(torch_npu.__version__) < version.parse("2.7.1rc1"), | ||
| "Need torch_npu 2.7.1rc1+", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it because there are some strcit version mapping between PyTorch and Torch_NPU
| ) | ||
|
|
||
| assert int_data.dtype == torch.int32, ( | ||
| f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| f"torch_npu.npu_convert_weight_to_int4pack expects `int32` dtype" | |
| f"torch.ops.npu.npu_convert_weight_to_int4pack expects `int32` dtype" |
| ) | ||
|
|
||
| assert int_data.shape[-1] % 8 == 0, ( | ||
| f"torch_npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| f"torch_npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" | |
| f"torch.ops.npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" |
|
Hi @jcaip @jerryzh168 , please help to review it, thanks! |
| and torch.accelerator.is_available(), | ||
| "NPU not available", | ||
| ) | ||
| class Int4PlainInt32TensorNPU(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, do we need NPUs to test this? I don't think we have any in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @orangeH25 @fffrog!
The code looks good to me, but I'm curious on how to best test this? It looks like we skip tests in CI because we don't have NPU devices. I believe that NPU support was added to TorchTune as well, do you know how they test device specific functionality there?
Also, just a heads up most of the team is at PTC / Open source AI week in SF this week, so we might be a little slow in responding :)
|
please don't include device |
| int4 weight-only quantization on Ascend NPU backend (groupwise quantization only) | ||
| Tensor Attributes: | ||
| qdata: (N, K/8), packed int4 weight, the data type is int32 here with 8*int4, the original dtype can be float16 or bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this exactly align with Int4PlainInt32Tensor? if so, please merge with that tensor subclass
|
Hi @jcaip @jerryzh168 ,thanks for the review!
Yes, this case is actually pretty common in open-source projects. A typical approach is to set up a
You mean that we should keep the entry logic in elif int4_packing_format == Int4PackingFormat.PLAIN_INT32:
new_weight = Int4PlainInt32Tensor.from_hp(
weight,
block_size,
)
return new_weightand then handle different backend implementations in the class Int4PlainInt32Tensor(TorchAOBaseTensor):
...
@classmethod
def from_hp(
cls,
w: torch.Tensor,
block_size: List[int],
):
if w.device.type == "xpu":
from_hp_xpu(cls, w, block_size)
elif w.device.type == "npu":
from_hp_npu(cls, w, block_size)
implements = Int4PlainInt32Tensor.implements
implements_torch_function = Int4PlainInt32Tensor.implements_torch_function
@implements(aten.linear.default)
@implements_torch_function(torch.nn.functional.linear)
def _(func, types, args, kwargs):
input_tensor, weight_tensor, bias = (
args[0],
args[1],
args[2] if len(args) > 2 else None,
)
if input_tensor.device.type == "xpu":
return linear_xpu(input_tensor, weight_tensor, bias)
elif input_tensor.device.type == "npu":
return linear_npu(input_tensor, weight_tensor, bias)Did I get that right? Happy to hear any thoughts or suggestions you might have! |
Yes that's correct |
Got it, I will follow this approach, thanks! |
7808297 to
ea2aa7a
Compare
Hi @jerryzh168 @jcaip , I’ve made those changes, please take a look, really appreciate it! |
|
Hi @jerryzh168 @jcaip, could you please take another look when you have a moment? Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple nits but looks good to me @orangeH25 @fffrog
Can we set up the downstream testing you mentioned before we merge this?
|
|
||
| y = torch.ops.npu.npu_weight_quant_batchmatmul( | ||
| x=act_mat, | ||
| weight=packed_weight.contiguous().transpose(-1, -2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to call contiguous() every time we do matmul? should we save the packed_weight in contiguous format instead to only do this once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Addressed — packed_weight are now made contiguous once when constructing the Int4PlainInt32Tensor.
| elif input_tensor.device.type == "npu": | ||
| return _linear_npu(input_tensor, weight_tensor, bias) | ||
| else: | ||
| raise AssertionError(f"Int4PlainInt32Tensor does not support device '{input_tensor.device.type}' yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: NotImplementedError or ValueError is better here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| elif w.device.type == "npu": | ||
| return _from_hp_npu(cls, w, block_size) | ||
| else: | ||
| raise AssertionError(f"Int4PlainInt32Tensor does not support device '{w.device.type}' yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ValueError or NotImplementedError here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| assert int_data.shape[-1] % 8 == 0, ( | ||
| f"torch.ops.npu.npu_convert_weight_to_int4pack expects last dim must be aligned to 8,but got {int_data.shape[-1]}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we ever run into a case where we have NPU support but this op is missing? Maybe in an earlier version of torch_npu? Should we throw a cleaner error message in that case?
It'd be good to add a comment here on where this op is defined and what version of torch npu is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder — since torch and torch_npu versions are tightly coupled, I added
# Require PyTorch 2.7.1+ for NPU backend ops and backward compatibility.
assert torch_version_at_least("2.7.1"), (
"Need PyTorch 2.7.1+ for NPU backend op support."
)at the beginning. Does this make the version requirement clear enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be a tad more explicit, I want to make it clear it's PyTorch NPU >= 2.7.1 and not regular torch
assert (torch.accelerator.is_available() and torch.accelerator.current_accelerator().type == "npu" and torch_version_at_least("2.7.1"), (
f"PyTorch NPU 2.7.1+ needed for int4 packing and matmul ops, {torch.__version__} found"
)
Sure! We’ll complete the downstream testing setup. |
|
Awesome, thank you! If you have any benchmarking numbers you can share as well that would be great :) |
Thanks! We’ll add some benchmarking results soon. |
Related to #3044
Summary
This PR adds NPU (Ascend) backend support for the INT4 weight-only quantization workflow.
It introduces a new tensor subclass,
Int4PlainInt32TensorNPU, aligned with the existingInt4PlainInt32Tensorfor theplain_int32packing format.Environment
Files changed
Modified
torchao/quantization/__init__.pytorchao/quantization/quant_api.pytorchao/quantization/quantize_/workflows/__init__.pyAdded
torchao/quantization/quantize_/workflows/int4/int4_plain_int32_tensor_npu.pytest/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor_npu.pyImplementation Overview
Int4PlainInt32TensorNPUto enable NPU backend support for INT4 weight-only quantization.quant_api.pyfor dispatch.__init__.pyfiles to ensure proper import and exposure.Test Case
test/quantization/quantize_/workflows/int4/test_int4_plain_int32_tensor_npu.py