You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In line with Torchscript front end, we should support INT8 precision in dynamo workflow by setting enabled_precisions={torch.int8} and passing a DataLoaderCalibrator or CacheCalibrator class in the calibrator argument
Goals
The support for INT8 precision has three phases
Phase 1: Full graph compilation
If the precision is set as INT8, this should imply require_full_compilation=True. We can unify Torchscript and Dynamo workflows a bit here. torch_tensorrt/py/ptq.py is the main file which holds these DataLoaderCalibrator and CacheCalibrator classes. A prototype implementation can be seen here https://github.com/pytorch/TensorRT/blob/int8_ptq/py/torch_tensorrt/ptq.py#L74-L104
Once you have the required algo_info, cache file, we build the derivates of INT8Calibrator class within dynamo (dynamo/utils.py)and ts (ts/_compile_spec.py) workflows.
We use pytorch_quantization toolkit to produce QAT graphs. For dynamo, we can take these graphs and apply torch.export/torch.compile on them.
a) First, we need to verify if these ops are produced in the outputs of dynamo.
b) The work required here would be to add converter support for torch.fake_quantize_per_tensor_affine and torch.fake_quantize_per_channel_affine ops.
Phase 3: INT8 + Fallback
In the case of PTQ fallback,
a) If a particular op is unsupported, TRT subgraphs would require their own calibrators and the dataset to calibrate on would be different.
b) If the particular op is supported but it is forced to fallback via torch_executed_ops, an alternative approach is to explicitly set this op precision to fp32 using TRT APIs (implying we run this graph as require_full_compilation=True) and proceed with the INT8 calibration (via PTQ API). We rely on TensorRT to handle the mixed precision inference of this graph.
In the case of QAT fallback, we don't have to explicitly do anything about this. The regions which are convertible to INT8 will be converted and the rest would run in FP32.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
INT8 support in dynamo workflows
TL;DR
In line with Torchscript front end, we should support INT8 precision in dynamo workflow by setting enabled_precisions={torch.int8} and passing a
DataLoaderCalibratororCacheCalibratorclass in thecalibratorargumentGoals
The support for INT8 precision has three phases
Phase 1: Full graph compilation
If the precision is set as INT8, this should imply
require_full_compilation=True. We can unify Torchscript and Dynamo workflows a bit here.torch_tensorrt/py/ptq.pyis the main file which holds theseDataLoaderCalibratorandCacheCalibratorclasses. A prototype implementation can be seen here https://github.com/pytorch/TensorRT/blob/int8_ptq/py/torch_tensorrt/ptq.py#L74-L104Once you have the required algo_info, cache file, we build the derivates of
INT8Calibratorclass within dynamo (dynamo/utils.py)and ts (ts/_compile_spec.py) workflows.Phase 2: QAT
We use pytorch_quantization toolkit to produce QAT graphs. For dynamo, we can take these graphs and apply torch.export/torch.compile on them.
a) First, we need to verify if these ops are produced in the outputs of dynamo.
b) The work required here would be to add converter support for
torch.fake_quantize_per_tensor_affineandtorch.fake_quantize_per_channel_affineops.Phase 3: INT8 + Fallback
In the case of PTQ fallback,
a) If a particular op is unsupported, TRT subgraphs would require their own calibrators and the dataset to calibrate on would be different.
b) If the particular op is supported but it is forced to fallback via
torch_executed_ops, an alternative approach is to explicitly set this op precision to fp32 using TRT APIs (implying we run this graph as require_full_compilation=True) and proceed with the INT8 calibration (via PTQ API). We rely on TensorRT to handle the mixed precision inference of this graph.In the case of QAT fallback, we don't have to explicitly do anything about this. The regions which are convertible to INT8 will be converted and the rest would run in FP32.
Prototype - Phase 1 : Medium, Phase 2: Medium
MVP
2.2- Phase 1 : Medium, Phase 2: MediumFuture work / Extensions
Beta Was this translation helpful? Give feedback.
All reactions