InternVLA-A1 is an end-to-end vision–language–action (VLA) framework unifing understanding, generation ,and action for robotic manipulation. It leverages predictive imagination of task evolution to guide execution, enabling enhanced manipulation in highly dynamic environments.
- Novel Model Archituecture: A Mixture-of-Transformers architecture for unified understanding, generation, and action.
- Hybrid Synthetic-Real Data Corpus: A hybrid synthetic-real manipulation dataset InternData-A1, integrating 5 heterogeneous robots, 15 skills, and 200+ scenes, emphasizing multi-robot collaboration under dynamic scenarios.
- Impressive Real-World performance: InternVLA-A1 demonstrates strong effectiveness and generalization in highly dynamic scenarios involving dynamic grasping of conveyor belts and multi-robot collaboration.
- F1-VLA (F1 is a prequel version of InternVLA-A1): Paper | Code | Model
- InternVLA-A1: Code | Paper/Model (Scheduled for late September release)
default.mp4
The model handles dynamically shaped packages on conveyor belts, tracking and predicting their trajectories in real-time to achieve high-speed stable grasping, while adaptively flipping packages and identifying express information from delivery notes.
multi-robot-long-horizon.mp4
The model swiftly identifies, locates, and grips high-speed ingredients based on task demands, showcasing its adaptability in complex environments.
- Python ≥ 3.10
- torch ≥ 2.6.0
- CUDA ≥ 12.4
# Clone repository
git clone https://github.com/InternRobotics/InternVLA-A1.git
# Create environment
conda create -f internvla_a1 python==3.10
conda activate internvla_a1
# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124
# install other requirements
pip install -r requirements.txt
pip install numpy==1.26.4
This project is licensed under the MIT License.