Add Int8Tensor for clearer interface #3038

namgyu-youn · 2025-09-21T11:06:20Z

Summary:
Introduce new tensor subclass API for int8 quantization with clearer interface.

The main change can be summarized to the following:

Old: Complex affine transform (AffineQuantizedTensor) with separate layout handling
New: Direct int8 tensor with scaling factor and zero point

Test plan:
test/quantization/quantize_/workflows/int8/test_int8_tensor.py

Introduce new tensor subclass API for int8 quantization with clearer interface. The main change can be summarized to the following: - Old: Complex affine transform (AffineQuantizedTensor) with separate layout handling - New: Direct int8 tensor with qdata, scale, and zero_point attributes Test plan: test/quantization/quantize_/workflows/int8/test_int8_tensor.py Future plan: Implement block-wise quantization using `block_size` parameter

pytorch-bot · 2025-09-21T11:06:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3038

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 · 2025-09-22T17:41:34Z

can you add a version 2 and expose this tensor through

ao/torchao/quantization/quant_api.py

Line 1497 in 8525185

class Int8DynamicActivationInt8WeightConfig(AOBaseConfig):

? similar to

ao/torchao/quantization/quant_api.py

Line 1752 in 8525185

class Float8DynamicActivationFloat8WeightConfig(AOBaseConfig):

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 · 2025-09-23T17:44:46Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        result = result.to(scale.dtype) * scale
+        result = result.view(*input_tensor.shape[:-1], -1)
+    else:
+        # FP × INT8 (static)


also this is the code for weight only quant I think:

ao/torchao/dtypes/uintx/plain_layout.py

Line 250 in 122b307

def _linear_fp_act_int8_weight_impl(input_tensor, weight_tensor, bias):

Done at 9383550 , thanks for pointing it out.

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

jerryzh168 · 2025-09-25T21:35:24Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+            raise ValueError("Expected 2D tensor and block_size length 2")
+
+        # Rounding function from high precision dtype
+        scale = w.abs().max(dim=-1, keepdim=True)[0] / 127.0


looks like block_size is not used? why is that?

you can checkout

ao/torchao/dtypes/uintx/plain_layout.py

Line 232 in 8c5c33e

def _linear_fp_act_int8_weight_check(input_tensor, weight_tensor, bias):

for expected granularity

also this should be using these quant primitive ops:

ao/torchao/quantization/quantize_/workflows/int4/int4_marlin_sparse_tensor.py

Lines 79 to 97 in 8c5c33e

scale, zero_point = choose_qparams_affine(

input=preprocessed_w,

mapping_type=MappingType.SYMMETRIC,

block_size=block_size,

target_dtype=target_dtype,

quant_min=quant_min,

quant_max=quant_max,

eps=1e-6,

)

wq = quantize_affine(

input=preprocessed_w,

block_size=block_size,

scale=scale,

zero_point=zero_point,

output_dtype=target_dtype,

quant_min=quant_min,

quant_max=quant_max,

)

, arguments can be found by tracing through the code path for int8 in

ao/torchao/quantization/quant_api.py

Line 1566 in 8c5c33e

new_weight = to_affine_quantized_intx(

and

ao/torchao/dtypes/affine_quantized_tensor.py

Line 325 in 8c5c33e

scale, zero_point = choose_qparams_affine(

this might require a bit too much context, let me know if you would like us to take over

Thanks, surely want to take over! Drafted this PR for those updates, but will look into it today (6 hours later)

btw, version 2 is updated at c53dad0 (version 1 is default)

jerryzh168

please rebase, and let me know when this is ready for review again @namgyu-youn

jerryzh168 · 2025-10-16T20:10:52Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        if not isinstance(activation_tensor, Int8Tensor):
+            if weight_tensor.act_quant_kwargs.static_scale is not None:
+                # INT8 × INT8 (static): symmetric quantization only
+                static_scale = weight_tensor.act_quant_kwargs.static_scale


OK if this is needed I think it should be included in _choose_quant_func_and_quantize_tensor as well?

jerryzh168 · 2025-10-16T20:11:24Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+implements_torch_function = Int8Tensor.implements_torch_function
+
+
+@implements([aten.dequantize.self])


is this needed? if not we should remove for now

jerryzh168 · 2025-10-16T20:11:46Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        if scale.numel() > 1 and scale.shape != qdata_fp.shape:
+            scale = scale.view(*scale.shape, *[1] * (qdata_fp.ndim - scale.ndim))


is this needed?

It is needed for block-level granularity. For example,

Row-wise: If scale shape is (64, 1) and w_q (quantized weight shape) is (256, 512), we can naturally broadcast them

Channel-wise: If scale shape is (512,) and w_q is (256, 512), we can naturally broadcast them

Block-size granularity: If scale shape is (32, 64) and w_q is (256, 512), we have to rescale to broadcast them.

But we can also reuse _maybe_expand_scale_to_tensor_shape, similar to:

ao/torchao/quantization/quantize_/workflows/float8/float8_tensor.py

Lines 149 to 154 in 4b79f9e

def dequantize(self, output_dtype: Optional[torch.dtype] = None) -> torch.Tensor:

if output_dtype is None:

output_dtype = self.dtype

qdata, scale = self.qdata, self.scale

return _dequantize_affine_float8(qdata, scale, output_dtype)

and

ao/torchao/quantization/quant_primitives.py

Lines 2407 to 2421 in f3fc5e7

def _dequantize_affine_float8(

tensor: torch.Tensor,

scale: torch.Tensor,

output_dtype: torch.dtype = torch.float32,

) -> torch.Tensor:

"""

Dequantizes the float8 tensor to high precision tensor.

"""

fp8_tensor = tensor.to(torch.float32)

# Expand scale to match tensor dimensions for block-wise quantization

scale_expanded = _maybe_expand_scale_to_tensor_shape(scale, tensor.shape)

hp_tensor = fp8_tensor * scale_expanded

return hp_tensor.to(output_dtype)

jerryzh168 · 2025-10-16T20:12:47Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        cls: type,
+        qdata: torch.Tensor,
+        scale: torch.Tensor,
+        block_size: list[int],


nit: I remember list has a higher python version requirements, so probably better to change this to List from typing I think

Thanks, it is only for List, not for Dict, Tuple, etc.?

probably also for Dict and Tuple, I have only tried list before

Is there a use case with old (< 3.9) python version? I remember list, dict, and tuple is natively supported to 3.10 https://docs.astral.sh/ruff/rules/non-pep585-annotation/.

Because PEP585 (type hint) is chained with pre-commit issue, prefer to focus on new versions (no need for from typing import List, Dict, Tuple). How about using list, dict, tuple focusing on new python versions? There might be a new issue by PEP 585 if we go with Dict, List, Tuple I feel.

jerryzh168 · 2025-10-16T20:13:20Z

torchao/quantization/quant_api.py

    return module


+def _unwrap_float8_linear(module: Float8Linear) -> nn.Linear:


some rebase issue?

Sorry, I assumed 0a45f90 was a wrong way, which was the start of the rebase issue. The solution looks like dropping relevant commits using rebase.

But rebasing after all those commits is overwhelming to me. So, I really don't want to open a duplicate PR, but may I reopen the PR and link to this PR? I just want to remove unrelevant codes change log.

@namgyu-youn please feel free to close and reopen a new one if it's hard to fix rebase issue, seems like it's still not fully fixed

Thanks, just want to remove unrelevant change logs as you mentioned.

namgyu-youn · 2025-10-17T07:22:37Z

Updated log:

Made block_size mandatory (resolves Add Int8Tensor for clearer interface #3038 (comment), Add Int8Tensor for clearer interface #3038 (comment))
Fix static quantization flows (resolves: Add Int8Tensor for clearer interface #3038 (comment))
Update block_size 3D-check logic (related to: Add Int8Tensor for clearer interface #3038 (comment))
Replace type hint List() with list(): Add Int8Tensor for clearer interface #3038 (comment)

To reviewers:
Unfortunately, I can't build and run local tests, caused by #2919, after trying downgrade and gradual installation. Please feel free to direct commit if test_int8_tensor.py fails.

jerryzh168 · 2025-10-23T04:21:38Z

torchao/utils.py

 from importlib.metadata import version
 from math import gcd
-from typing import Any, Callable, Optional, Type
+from typing import Any, Callable, Optional


please fix rebase to not have these changes, or open a new PR if you don't know how to fix rebase

jerryzh168

please fix rebase, otherwise seems mostly OK I think

jerryzh168 · 2025-10-23T22:12:17Z

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

+    @common_utils.parametrize(
+        "sizes",
+        [
+            ((128,), 256, 128),


does 3D inputs work? e.g. ((32, 128,), 256, 128),

No, 3D input raise ValueError triggered by from_hp(): #3038 (comment)

jerryzh168 · 2025-10-23T22:18:18Z

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

+        assert error > 20, f"Quantization error is too high got a SQNR of {error}"
+
+    @common_utils.parametrize("dtype", [torch.bfloat16, torch.float16])
+    def test_static_quantization(self, dtype):


can you add a test the static quant config and test that as well Int8StaticActivationInt8WeightConfig

or maybe you can remove this for now and coordinate with @Xia-Weiwen (https://github.com/pytorch/ao/pull/3089/files#diff-bf4d50867e3d649de2d89146592bf47d2f258c4c19126c8acf0e120ee904b726) to add the static quant support separately?

Both are ok for me. Thanks.

jerryzh168 · 2025-10-23T22:19:07Z

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

+    @common_utils.parametrize(
+        "config",
+        [
+            Int8DynamicActivationInt8WeightConfig(version=2),


will need to test the static quant as well, if that is added

jerryzh168 · 2025-10-23T22:19:37Z

torchao/quantization/quant_api.py

changes in this PR should be reverted as well

jerryzh168 · 2025-10-23T22:20:13Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+    )
+
+
+@implements(aten.select.int)


is this tested? if not, please remove for now

jerryzh168 · 2025-10-23T22:22:12Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+    if tensor.scale.numel() == 1:
+        # Per-tensor quantization - scale doesn't change
+        sliced_scale = tensor.scale
+    elif dim < tensor.scale.ndim and tensor.scale.shape[dim] > 1:
+        # Block-wise quantization - need to slice the scale appropriately
+        sliced_scale = func(tensor.scale, dim, start, end, step)
+    else:
+        sliced_scale = tensor.scale


can you match the implementation with Float8Tensor?

ao/torchao/quantization/quantize_/workflows/float8/float8_tensor.py

Lines 449 to 456 in 53b5efd

if self.scale.numel() == 1:

# Per-tensor quantization - scale doesn't change

sliced_scale = self.scale

else:

# Block-wise quantization - need to slice the scale appropriately

sliced_scale = _slice_scale_for_dimension(

self.scale, self.qdata.shape, dim, start, end, step

)

namgyu-youn · 2025-10-24T12:18:46Z

Updated logs:
In 062f3cc

Add static/dynamic unit test: Add Int8Tensor for clearer interface #3038 (comment)
Remove relevant dequantization function: Add Int8Tensor for clearer interface #3038 (comment)
Revert List -> list: Add Int8Tensor for clearer interface #3038 (comment)
Update api to support per-row and per-tensor granularity: related to Add Int8Tensor for clearer interface #3038 (comment)
Update tensor slicing operation similar to Float8Tensor: Add Int8Tensor for clearer interface #3038 (comment)

In 680cec9

Update setUp for common used args
Split test_error_handling_and_dequant unit test into test_invalid_input_handling & test_dequantization_accuracy

namgyu-youn · 2025-10-24T12:58:44Z

@jerryzh168 sorry for the multiple-PRs again; reopened PR (#3241) is copy-pasted after the last commit in this PR and resolves rebase errors. Please check the above comment (change log; #3038 (comment)) first, and please take a look at #3241, thanks.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 21, 2025

jerryzh168 reviewed Sep 22, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Sep 22, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

namgyu-youn added 2 commits September 23, 2025 02:45

rename for clearly: Int8PlainInt8Tensor -> Int8Tensor

db23cf3

add flags for static/dynamic quant

b861dbc

namgyu-youn changed the title ~~Add Int8PlainInt8Tensor for clearer interface~~ Add Int8Tensor for clearer interface Sep 23, 2025

namgyu-youn requested a review from jerryzh168 September 23, 2025 15:12

jerryzh168 reviewed Sep 23, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Sep 23, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Sep 23, 2025

View reviewed changes

test/quantization/quantize_/workflows/int8/test_int8_tensor.py Show resolved Hide resolved

namgyu-youn added 4 commits September 25, 2025 01:33

update static/dynamic quantization workflows

9383550

add kernel preference unit test

2c84ba4

add kernel preference unit test

8ddddd3

Merge remote-tracking branch 'upstream/main' into int8-quant

bd6f58a

namgyu-youn requested a review from jerryzh168 September 24, 2025 17:26

fix missing attribute

b5cb3c8

jerryzh168 mentioned this pull request Sep 25, 2025

[WIP]Adds _weight_int8pack_mm pass for woq-int8 #3061

Open

jerryzh168 reviewed Sep 25, 2025

View reviewed changes

remove kernel preference args

9a51cae

namgyu-youn marked this pull request as draft September 28, 2025 13:23

namgyu-youn added 2 commits September 28, 2025 23:48

link new API with old API using version 2

c53dad0

add granularity, block size support

d300b02

namgyu-youn marked this pull request as ready for review September 30, 2025 06:09

namgyu-youn requested a review from jerryzh168 September 30, 2025 06:09

namgyu-youn mentioned this pull request Oct 3, 2025

make smoothquant more PT2 friendly #1639

Open

jerryzh168 reviewed Oct 4, 2025

View reviewed changes

jerryzh168 reviewed Oct 16, 2025

View reviewed changes

namgyu-youn added 2 commits October 17, 2025 15:57

update int8-quant api

844d99d

update type-hint to prevent depenedency issue

a844678

namgyu-youn requested a review from jerryzh168 October 17, 2025 07:32

fix ci error

2c0389a

jerryzh168 reviewed Oct 23, 2025

View reviewed changes

jerryzh168 requested changes Oct 23, 2025

View reviewed changes

revert unrelated changes

bafeb43

namgyu-youn requested a review from jerryzh168 October 23, 2025 04:38

namgyu-youn marked this pull request as draft October 23, 2025 12:27

namgyu-youn added 2 commits October 23, 2025 21:58

fix rebase

7006cae

update int8 quant api

49a7a89

namgyu-youn marked this pull request as ready for review October 23, 2025 17:37

jerryzh168 reviewed Oct 23, 2025

View reviewed changes

torchao/quantization/quant_api.py

Copy link

Contributor

jerryzh168 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this PR should be reverted as well

jerryzh168 reviewed Oct 23, 2025

View reviewed changes

namgyu-youn added 2 commits October 24, 2025 19:55

update int8

062f3cc

build setup for unit test, enable per-row/per-tensor granuarity

680cec9

namgyu-youn mentioned this pull request Oct 24, 2025

introduce new int8 quantization API #3241

Open

namgyu-youn closed this Oct 24, 2025

	scale, zero_point = choose_qparams_affine(
	input=preprocessed_w,
	mapping_type=MappingType.SYMMETRIC,
	block_size=block_size,
	target_dtype=target_dtype,
	quant_min=quant_min,
	quant_max=quant_max,
	eps=1e-6,
	)

	wq = quantize_affine(
	input=preprocessed_w,
	block_size=block_size,
	scale=scale,
	zero_point=zero_point,
	output_dtype=target_dtype,
	quant_min=quant_min,
	quant_max=quant_max,
	)

		implements_torch_function = Int8Tensor.implements_torch_function


		@implements([aten.dequantize.self])

		if scale.numel() > 1 and scale.shape != qdata_fp.shape:
		scale = scale.view(scale.shape, [1] * (qdata_fp.ndim - scale.ndim))

	def dequantize(self, output_dtype: Optional[torch.dtype] = None) -> torch.Tensor:
	if output_dtype is None:
	output_dtype = self.dtype

	qdata, scale = self.qdata, self.scale
	return _dequantize_affine_float8(qdata, scale, output_dtype)

	def _dequantize_affine_float8(
	tensor: torch.Tensor,
	scale: torch.Tensor,
	output_dtype: torch.dtype = torch.float32,
	) -> torch.Tensor:
	"""
	Dequantizes the float8 tensor to high precision tensor.
	"""
	fp8_tensor = tensor.to(torch.float32)

	# Expand scale to match tensor dimensions for block-wise quantization
	scale_expanded = _maybe_expand_scale_to_tensor_shape(scale, tensor.shape)

	hp_tensor = fp8_tensor * scale_expanded
	return hp_tensor.to(output_dtype)

		return module


		def _unwrap_float8_linear(module: Float8Linear) -> nn.Linear:

	if self.scale.numel() == 1:
	# Per-tensor quantization - scale doesn't change
	sliced_scale = self.scale
	else:
	# Block-wise quantization - need to slice the scale appropriately
	sliced_scale = _slice_scale_for_dimension(
	self.scale, self.qdata.shape, dim, start, end, step
	)

		)


		@implements(aten.select.int)

Add Int8Tensor for clearer interface #3038

Add Int8Tensor for clearer interface #3038

Uh oh!

Conversation

namgyu-youn commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3038

Uh oh!

Uh oh!

jerryzh168 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Sep 21, 2025 •

edited

Loading

pytorch-bot bot commented Sep 21, 2025 •

edited

Loading

namgyu-youn Sep 24, 2025 •

edited

Loading

namgyu-youn Sep 29, 2025 •

edited

Loading

namgyu-youn Oct 17, 2025 •

edited

Loading

namgyu-youn Oct 24, 2025 •

edited

Loading

namgyu-youn Oct 23, 2025 •

edited

Loading

jerryzh168 Oct 23, 2025 •

edited

Loading

namgyu-youn commented Oct 17, 2025 •

edited

Loading

jerryzh168 Oct 23, 2025 •

edited

Loading

jerryzh168 Oct 23, 2025 •

edited

Loading