vision.cpp

Computer Vision ML inference in C++

Self-contained C++ library
Efficient inference on consumer CPU and GPUs (NVIDIA, AMD, Intel)
Lightweight deployment on many platforms (Windows, Linux, MacOS)
Growing number of supported models behind a simple API
Modular design for full control and implementing your own models

Based on ggml similar to the llama.cpp project.

Features

Model	Task	Backends
MobileSAM	Promptable segmentation	CPU, Vulkan
BiRefNet	Dichotomous segmentation	CPU, Vulkan
Depth-Anything	Depth estimation	CPU, Vulkan
MI-GAN	Inpainting	CPU, Vulkan
ESRGAN	Super-resolution	CPU, Vulkan
Implement a model [Guide]

Backbones: SWIN (v1), DINO (v2), TinyViT

Get Started

Get the library and executables:

Download a release package and extract it,
or build from source.

Example: Select an object in an image

Let's use MobileSAM to generate a segmentation mask of the plushy on the right by passing in a box describing its approximate location.

Example image showing box prompt at pixel location (420, 120) - (650, 430), and the output mask

You can download the model and input image here: MobileSAM-F16.gguf | input.jpg

CLI

Find the vision-cli executable in the bin folder and run it to generate the mask:

vision-cli -m MobileSAM-F16.gguf -i input.jpg -p 420 120 650 430 -o mask.png

Pass --composite output.png to composite input and mask. Use --help for more options.

API

#include <visp/vision.h>
using namespace visp;

void main() {
  backend_device cpu = backend_init(backend_type::cpu);
  sam_model sam = sam_load_model("MobileSAM-F16.gguf", cpu);
  
  image_data input_image = image_load("input.jpg");
  sam_encode(sam, input_image);

  image_data object_mask = sam_compute(sam, box_2d{{420, 120}, {650, 320}});
  image_save(object_mask, "mask.png");
}

This shows the high-level API. Internally it is composed of multiple smaller functions that handle model loading, pre-processing inputs, transferring data to backend devices, post-processing output, etc. These can be used as building blocks for flexible functions which integrate with your existing data sources and infrastructure.

Models

MobileSAM

Model download | Paper (arXiv) | Repository (GitHub) | Segment-Anything-Model | License: Apache-2

vision-cli sam -m MobileSAM-F16.gguf -i input.png -p 300 200 -o mask.png --composite comp.png

BiRefNet

Model download | Paper (arXiv) | Repository (GitHub) | License: MIT

vision-cli birefnet -m BiRefNet-lite-F16.gguf -i input.png -o mask.png --composite comp.png

Depth-Anything V2

Model download | Paper (arXiv) | Repository (GitHub) | License: Apache-2 / CC-BY-NC-4

vision-cli depth-anything -m Depth-Anything-V2-Small-F16.gguf -i input.png -o depth.png

MI-GAN

Model download | Paper (thecvf.com) | Repository (GitHub) | License: MIT

vision-cli migan -m MIGAN-512-places2-F16.gguf -i image.png mask.png -o output.png

Real-ESRGAN

Model download | Paper (arXiv) | Repository (GitHub) | License: BSD-3-Clause

vision-cli esrgan -m ESRGAN-4x-foolhardy_Remacri-F16.gguf -i input.png -o output.png

Converting models

Models need to be converted to GGUF before they can be used. This will also rearrange or precompute tensors for more optimal inference.

To convert a model, install uv and run:

uv run scripts/convert.py <arch> MyModel.pth

where <arch> is one of sam, birefnet, esrgan, ....

This will create models/MyModel.gguf. See convert.py --help for more options.

Building

Building requires CMake and a compiler with C++20 support.

Get the sources

git clone https://github.com/Acly/vision.cpp.git --recursive
cd vision.cpp

Configure and build

cmake . -B build -D CMAKE_BUILD_TYPE=Release
cmake --build build --config Release

Vulkan (Optional)

Building with Vulkan GPU support requires the Vulkan SDK to be installed.

cmake . -B build -D CMAKE_BUILD_TYPE=Release -D VISP_VULKAN=ON

Tests (Optional)

Build with -DVISP_TESTS=ON. Run all C++ tests with the following command:

cd build
ctest -C Release

Some tests require a Python environment. It can be set up with uv:

# Setup venv and install dependencies (once only)
uv sync

# Run python tests
uv run pytest

Performance

Performance optimization is an ongoing process. The aim is to be in the same ballpark as other frameworks for inference speed, but with:

much faster initialization and model loading time (<100 ms)
lower memory overhead
tiny deployment size (<5 MB for CPU, +30 MB for GPU)

Inference speed

CPU: AMD Ryzen 5 5600X (6 cores)
GPU: NVIDIA GeForce RTX 4070

MobileSAM, 1024x1024

		vision.cpp	PyTorch	ONNX Runtime
cpu	f32	669 ms	601 ms	805 ms
gpu	f16	19 ms	16 ms

BiRefNet, 1024x1024

Model			vision.cpp	PyTorch	ONNX Runtime
Full	cpu	f32	16333 ms	18290 ms
Full	gpu	f16	208 ms	190 ms
Lite	cpu	f32	4505 ms	10900 ms	6978 ms
Lite	gpu	f16	85 ms	84 ms

Depth-Anything, 518x714

Model			vision.cpp	PyTorch
Small	gpu	f16	11 ms	10 ms
Base	gpu	f16	24 ms	22 ms

MI-GAN, 512x512

Model			vision.cpp	PyTorch
512-places2	cpu	f32	523 ms	637 ms
512-places2	gpu	f16	21 ms	17 ms

Setup

vision.cpp: using vision-bench, GPU via Vulkan, eg. vision-bench -m sam
PyTorch: v2.7.1+cu128, eager eval, GPU via CUDA, average n iterations after warm-up

Dependencies (integrated)

ggml - ML tensor library | MIT
stb-image - Image load/save/resize | Public Domain
fmt - String formatting (only if compiler doesn't support <format>) | MIT

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.github/workflows		.github/workflows
depend		depend
docs		docs
include/visp		include/visp
models		models
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vision.cpp

Features

Get Started

Example: Select an object in an image

CLI

API

Models

MobileSAM

BiRefNet

Depth-Anything V2

MI-GAN

Real-ESRGAN

Converting models

Building

Vulkan (Optional)

Tests (Optional)

Performance

Inference speed

MobileSAM, 1024x1024

BiRefNet, 1024x1024

Depth-Anything, 518x714

MI-GAN, 512x512

Setup

Dependencies (integrated)

About

Uh oh!

Releases 2

Packages

Languages

License

Acly/vision.cpp

Folders and files

Latest commit

History

Repository files navigation

vision.cpp

Features

Get Started

Example: Select an object in an image

CLI

API

Models

MobileSAM

BiRefNet

Depth-Anything V2

MI-GAN

Real-ESRGAN

Converting models

Building

Vulkan (Optional)

Tests (Optional)

Performance

Inference speed

MobileSAM, 1024x1024

BiRefNet, 1024x1024

Depth-Anything, 518x714

MI-GAN, 512x512

Setup

Dependencies (integrated)

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages