| Blog |
๐ Hi, Iโm ThomasVonWu. I'd like to introduce you to a  simple and practical deployment repository based on TensorRT which uses end-to-end perception paradigm with sparse transformer to sense 3D obstacles. This repository has no complex dependency for Training | Inference | Deployment(which means, we don't need to install MMDetection3d, mmcv, mmcv-full, mmdeploy, etc.), so it's easy to install in your local workstation or supercomputing gpu clusters. This repository will also provide x86(NVIDIA RTX  Series GPU) | ARM(NVIDIA ORIN) deployment solutions. Finally, you can deploy your e2e model onborad through this repository happily.
๐ I guess you are interested in:
- how to define a PyTorch custom operator: DeformableAttentionAggr and register related ONNX node.
 - how to build a custom opertator plugin: DeformableAttentionAggr in TensorRT engine with Makefile or CMake.
 - how to convert ONNX format file with custom opertator to TensorRT engine and make it as part of the whole engine.
 - how to validate inference results consistency : PyTorch results vs. ONNX Runtime results vs. TensorRT results.
 - how to convert PyTorch model with temporal fusion transformer head to ONNX.
 - how to locate the TensorRT layer accurately when overflow occurs during using fp16 quantization for model parameter.
 
These training reproduction experiments were conducted using 4 NVIDIA H20 GPUs with 96 GB memory.
| model | repository | backbone | pretrain | img size | Epoch | Traning | FPS | NDS | mAP | AMOTA | AMOTP | IDS | config | ckpt | log | GPU | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sparse4Dv3 | HorizonRobotics/Sparse4D | Resnet50 | ImageNet | 256x704 | 100 | 22H | 19.8 | 0.5637 | 0.4646 | 0.477 | 1.167 | 456 | - | - | - | RTX3090 | 
| Sparse4Dv3 | ThomasVonWu/SparseEnd2End | Resnet50 | ImageNet | 256x704 | 150 | 77.5H | - | 0.5623 | 0.4645 | 0.457 | 1.196 | 541 | cfg | ckpt | log | H20 | 
| Model | ImgSize | Backbone | Framework | Precision | mAP | NDS | FPS | GPU | config | ckpt | onnx | engine | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sparse4Dv3 | 256x704 | Resnet50 | PyTorch | FP32 | 0.4645 | 0.5623 | 15.8 | RTX 3090 | config | ckpt | -- | -- | 
| Sparse4Dv3 | 256x704 | Resnet50 | TensorRT | FP32 | wait | wait | wait | RTX 3090 | config | ckpt | onnx | engine | 
| Sparse4Dv3 | 256x704 | Resnet50 | TensorRT | FP16 | wait | wait | wait | RTX 3090 | config | ckpt | wait | wait | 
| Sparse4Dv3 | 256x704 | Resnet50 | TensorRT | INT8+FP16 | wait | wait | wait | RTX 3090 | config | ckpt | wait | wait | 
| Sparse4Dv3 | 256x704 | Resnet50 | TensorRT | FP32 | wait | wait | wait | NVIDIA ORIN | config | ckpt | wait | wait | 
| Sparse4Dv3 | 256x704 | Resnet50 | TensorRT | FP16 | wait | wait | wait | NVIDIA ORIN | config | ckpt | wait | wait | 
| Sparse4Dv3 | 256x704 | Resnet50 | TensorRT | INT8+FP16 | wait | wait | wait | NVIDIA ORIN | config | ckpt | wait | wait | 
24 Sep, 2024: I release repository: SparseEnd2End. The complete deployment solution was released.25 Aug, 2024: I release repository: SparseEnd2End. The complete deployment solution will be released as soon as possible. Please stay tuned!
-  Register custom operation : DeformableAttentionAggr and export ONNX and TensorRT engine. 
25 Aug, 2024 -  Verify the consistency of reasoning results : DeformableAttentionAggr  PyToch Implementation  vs. TensorRT plugin Implementation. 
25 Aug, 2024 -  Export SparseTransFormer Backbone ONNX&TensorRT engine. 
8 Sep, 2024 -  Verify the consistency of reasoning results : SparseTransFormer Backbone PyTorch Implementation vs. ONNX Runtime vs. TensorRT engine. 
8 Sep, 2024 -  Export SparseTransFormer Head ONNX and TensorRT engine. 
24 Sep, 2024 -  Verify the consistency of reasoning results : SparseTransFormer Head PyTorch Implementation vs. TensorRT engine. 
24 Sep, 2024 - Reasoning acceleration using CUDA shared memory and CUDA FP16 in DeformableAttentionAggr plugin Implementation.
 - Reasoning acceleration using FlashAttention in replace of MultiheadAttention.
 - Reasoning acceleration using FP16/INT8 in replace of FP32 for TensorRT engine.
 - Reasoning acceleration : Image pre-processing Instancbank Caching and model post-processing Implementation with CUDA.
 - Image pre-processing, Instancbank Caching and model post-processing Implementation with C++.
 - Onboard: Full-link reasoning using CUDA, TensorRT and C++.
 
SparseEnd2End is a Sparse-Centric paradigm for end-to-end autonomous driving perception.
If you find SparseEnd2End useful in your research or applications, please consider giving me a star ๐
08/25/2024๏ผ [v1.0.0] This repository now supports Training | Inference in NuscenesDataset. It includes: data dump in JSON, Training | Inference log caching, TensorBoard hooking, and so on.
11/14/2024๏ผ [v2.0.0] Reproduce training results of HorizonRobotics/Sparse4D with FP32.

