Computer Vision ML inference in C++
- Self-contained C++ library
- Efficient inference on consumer CPU and GPUs (NVIDIA, AMD, Intel)
- Lightweight deployment on many platforms (Windows, Linux, MacOS)
- Growing number of supported models behind a simple API
- Modular design for full control and implementing your own models
Based on ggml similar to the llama.cpp project.
| Model | Task | Backends |
|---|---|---|
| MobileSAM | Promptable segmentation | CPU, Vulkan |
| BiRefNet | Dichotomous segmentation | CPU, Vulkan |
| Depth-Anything | Depth estimation | CPU, Vulkan |
| MI-GAN | Inpainting | CPU, Vulkan |
| ESRGAN | Super-resolution | CPU, Vulkan |
| Implement a model [Guide] |
Backbones: SWIN (v1), DINO (v2), TinyViT
Get the library and executables:
- Download a release package and extract it,
- or build from source.
Let's use MobileSAM to generate a segmentation mask of the plushy on the right by passing in a box describing its approximate location.
You can download the model and input image here: MobileSAM-F16.gguf | input.jpg
Find the vision-cli executable in the bin folder and run it to generate the mask:
vision-cli -m MobileSAM-F16.gguf -i input.jpg -p 420 120 650 430 -o mask.pngPass --composite output.png to composite input and mask. Use --help for more options.
#include <visp/vision.h>
using namespace visp;
void main() {
backend_device cpu = backend_init(backend_type::cpu);
sam_model sam = sam_load_model("MobileSAM-F16.gguf", cpu);
image_data input_image = image_load("input.jpg");
sam_encode(sam, input_image);
image_data object_mask = sam_compute(sam, box_2d{{420, 120}, {650, 320}});
image_save(object_mask, "mask.png");
}This shows the high-level API. Internally it is composed of multiple smaller functions that handle model loading, pre-processing inputs, transferring data to backend devices, post-processing output, etc. These can be used as building blocks for flexible functions which integrate with your existing data sources and infrastructure.
Model download | Paper (arXiv) | Repository (GitHub) | Segment-Anything-Model | License: Apache-2
vision-cli sam -m MobileSAM-F16.gguf -i input.png -p 300 200 -o mask.png --composite comp.png
Model download | Paper (arXiv) | Repository (GitHub) | License: MIT
vision-cli birefnet -m BiRefNet-lite-F16.gguf -i input.png -o mask.png --composite comp.png
Model download | Paper (arXiv) | Repository (GitHub) | License: Apache-2 / CC-BY-NC-4
vision-cli depth-anything -m Depth-Anything-V2-Small-F16.gguf -i input.png -o depth.png
Model download | Paper (thecvf.com) | Repository (GitHub) | License: MIT
vision-cli migan -m MIGAN-512-places2-F16.gguf -i image.png mask.png -o output.png
Model download | Paper (arXiv) | Repository (GitHub) | License: BSD-3-Clause
vision-cli esrgan -m ESRGAN-4x-foolhardy_Remacri-F16.gguf -i input.png -o output.pngModels need to be converted to GGUF before they can be used. This will also rearrange or precompute tensors for more optimal inference.
To convert a model, install uv and run:
uv run scripts/convert.py <arch> MyModel.pthwhere <arch> is one of sam, birefnet, esrgan, ....
This will create models/MyModel.gguf. See convert.py --help for more options.
Building requires CMake and a compiler with C++20 support.
Get the sources
git clone https://github.com/Acly/vision.cpp.git --recursive
cd vision.cppConfigure and build
cmake . -B build -D CMAKE_BUILD_TYPE=Release
cmake --build build --config ReleaseBuilding with Vulkan GPU support requires the Vulkan SDK to be installed.
cmake . -B build -D CMAKE_BUILD_TYPE=Release -D VISP_VULKAN=ONBuild with -DVISP_TESTS=ON. Run all C++ tests with the following command:
cd build
ctest -C ReleaseSome tests require a Python environment. It can be set up with uv:
# Setup venv and install dependencies (once only)
uv sync
# Run python tests
uv run pytestPerformance optimization is an ongoing process. The aim is to be in the same ballpark as other frameworks for inference speed, but with:
- much faster initialization and model loading time (<100 ms)
- lower memory overhead
- tiny deployment size (<5 MB for CPU, +30 MB for GPU)
- CPU: AMD Ryzen 5 5600X (6 cores)
- GPU: NVIDIA GeForce RTX 4070
| vision.cpp | PyTorch | ONNX Runtime | ||
|---|---|---|---|---|
| cpu | f32 | 669 ms | 601 ms | 805 ms |
| gpu | f16 | 19 ms | 16 ms |
| Model | vision.cpp | PyTorch | ONNX Runtime | ||
|---|---|---|---|---|---|
| Full | cpu | f32 | 16333 ms | 18290 ms | |
| Full | gpu | f16 | 208 ms | 190 ms | |
| Lite | cpu | f32 | 4505 ms | 10900 ms | 6978 ms |
| Lite | gpu | f16 | 85 ms | 84 ms |
| Model | vision.cpp | PyTorch | ||
|---|---|---|---|---|
| Small | gpu | f16 | 11 ms | 10 ms |
| Base | gpu | f16 | 24 ms | 22 ms |
| Model | vision.cpp | PyTorch | ||
|---|---|---|---|---|
| 512-places2 | cpu | f32 | 523 ms | 637 ms |
| 512-places2 | gpu | f16 | 21 ms | 17 ms |
- vision.cpp: using vision-bench, GPU via Vulkan, eg.
vision-bench -m sam - PyTorch: v2.7.1+cu128, eager eval, GPU via CUDA, average n iterations after warm-up