TorchTRT Compilation Memory Consumption Management #3839
Replies: 5 comments 2 replies
-
In Qwen, TRT builder uses 1x to build live engine. |
Beta Was this translation helpful? Give feedback.
-
INetworkDefinition does not take any memory actually, it is the lowered graph and constand folding takes the up to 1x (0-1x) memory. Code here: INetwork is just hold the reference to the weights in the lowered graph |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Resource Aware Graph ShardingTL;DRWe found that if we split graphs into small distinct engines we can get roughly the same perf but reduce peak memory consumption. So we want to design a phase of the compiler to cut the graph up. Goal(s)
UsecasesProposed APIs / UXPhase 1 (Experimental / Beta Stability) [2.10]
torch_tensorrt.compile(module, resource_aware_sharding=True) Default behavior is we estimate the max graph size and shard accordingly
torch_tensorrt.compile(module, compile_peak_host_memory_consumption=1e10) We take this as the max CPU memory we can use and shard. Phase 1 (Stable) [2.11+]torch_tensorrt.compile(module, disable_resource_aware_sharding=False) Default behavior is we estimate the max graph size and shard accordingly without user intervention
torch_tensorrt.compile(module, compile_peak_host_memory_consumption=1e10) We take this as the max CPU memory we can use and shard. Example WorkflowLimitationsInternal ImplementationDesign
Concept 1: Assumptions
Extensions Required to Core API implementationsData Structures
graph(x, w, b, scale, ...):
aten::conv
aten::batch_norm
aten::relu
return graph(x, w, b):
aten::mm
aten::add
return subgraph matching Details specific for FX supportImplementation PhasesPrototype -MVP
|
Beta Was this translation helpful? Give feedback.
-
Ensure that outputs are compatible with inputs |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Beta Was this translation helpful? Give feedback.
All reactions