PhotoPrism® Computer Vision Models

This repository provides a web service with advanced computer vision models for use with PhotoPrism®.

Local Models

The currently integrated models, each with its own endpoint, are kosmos-2, vit-gpt2-image-captioning, and blip-image-captioning large:

Kosmos-2

Komsos-2 is the most accurate model of the three. It was developed by Microsoft, and this application uses the transformers implementation of the original model, as described in its Huggingface. This model was released in June 2023, and offers object detection and spatial reasoning. Kosmos-2 has very accurate image captions (a .04-.1 increase in clip score when compared to the other two models offered), and is the default model used.

VIT-GPT2

This model was released by nlpconnect. This model combined VIT and GPT-2 to create a multi-modal image captioning model. I have found this to be the least performing of the three, but your mileage may vary.

BLIP

This model was released by Salesforce in 2022. The primary purpose for this model was to increase both image understanding and text generation using novel techniques. It has achieved a +2.8% CIDEr result, and I've found this model to be more performant than VIT-GPT2, but Kosmos-2 to be slightly better (a .4 increase in CLIP score).

nsfw_image_detector

This model was released by Freepik. This model can only calculate NSFW weights within four categories: neutral, low, medium, high.

Mapping is done with the best effort to the current API structure.

Remote integrations

OLLAMA

Currently, there is implemented ollama integration.

Configuration

Ollama usage can be configured through environment variables.

ENV	Default value	Meaning
OLLAMA_ENABLED	false	true enables loading of integration
OLLAMA_HOST	http://localhost:11434	Url to OLLAMA instance
OLLAMA_NSFW_PROMPT	see code	Prompt used for NSFW detection
OLLAMA_LABELS_PROMPT	see code	Prompt used for label extraction
OLLAMA_CAPTION_PROMPT	see code	Prompt used for caption extraction

Models

For usage of models in ollama see a model library and official documentation

Usually you pull model in advance to be available for inference. You can list them with command ollama list. Name of model is in first column including tag.

llava-phi3:latest                             c7edd7b87593    2.9 GB    45 hours ago    
gemma3:4b-it-qat                              d01ad0579247    4.0 GB    45 hours ago    
gemma3:12b-it-qat                             5d4fa005e7bb    8.9 GB    2 days ago      
gemma3:27b-it-qat                             29eb0b9aeda3    18 GB     2 days ago      
gemma3:latest                                 c0494fe00251    3.3 GB    6 weeks ago     
phi4:latest                                   ac896e5b8b34    9.1 GB    2 months ago    
qwen2.5:latest                                845dbda0ea48    4.7 GB    2 months ago

Requirements for running LLM may be roughly estimated from its size. If model has 4 GiB, Then it will probably fit into any GPU with 8 GiB VRAM. If the model doesn't fit into VRAM, it will run on CPU and it will be much slower (but may be still usable).

Real requirements depend on context length and many other parameters, so you should test manually what model fits your requirements based on the quality of inference and speed of inference on your HW.

Dependencies

Flask

Flask is the framework that is used for the API. It allows for API creation with Python, which is key for this application as it utilizes ML.

PyTorch

PyTorch is key for working with the ML models to generate the outputs. It also enables GPU processing, speeding up the image processing with the models. PyTorch primarily creates and handles tensors, which are crucial for the function of the models.

Transformers

Transformers is used for downloading and loading the models. In addition to this it is used in the image processing with the models.

Pillow

Pillow is used to take the supplied URL and convert it into the format needed to input into the models.

pydantic

pydantic is used for JSON schemas, serialization and deserialization of requests and responses.

ollama

ollama is used as integration library that connects to any given ollama instance.

timm

timm is a tensorflow extension for timm models. Currently used for NSFW detection.

huggingface_hub[hf_xet]

xet Extension used for faster downloading of huggingface models.

Build Setup

Before installing the Python dependencies, please make sure that you have Git and Python 3.12+ (incl. pip) installed on your system, e.g. by running the following command on Ubuntu/Debian Linux:

sudo apt-get install -y git python3 python3-pip python3-venv python3-wheel

You can then install the required libraries in a virtual environment by either using the Makefiles we provide (i.e. run make in the main project directory or a subdirectory) or by manually running the following commands in a service directory, for example:

git clone [email protected]:photoprism/photoprism-vision.git
cd photoprism-vision/describe
python3 -m venv ./venv
. ./venv/bin/activate
./venv/bin/pip install --disable-pip-version-check --upgrade pip
./venv/bin/pip install --disable-pip-version-check -r requirements.txt

Usage

Run the Python file app.py in the describe subdirectory to start the describe service after you have installed the dependencies (more services, e.g. for OCR and tag generation, may follow):

./venv/bin/python app.py

The service then listens on port 5000 by default and its API endpoints for generating captions support both GET and POST requests. It can be tested with the curl command (curl.exe on Windows) as shown in the example below:

curl -v -H "Content-Type: application/json" \
  --data '{"url":"https://dl.photoprism.app/img/team/avatar.jpg"}' \
  -X POST http://localhost:5000/api/v1/vision/caption

At a minimum, a valid image url must be specified for this. In addition, a model name and an arbitrary id can be passed. The API will return the same id in the response. If no id is passed, a randomly generated UUID will be returned instead.

If your client submits POST requests, the request body must be JSON-encoded, e.g.:

{
    "id": "3487da77-246e-4b4c-9437-67507177bcd7",
    "model": "llama3.2-vision",
    "version": "latest",
    "prompt": "In up to 3 sentences, describe what you see in this image.",
    "images": [
      "data:image/png;base64,iVBORw0KGgo..."
    ]
}

Alternatively, you can perform GET requests with URL-encoded query parameters, which is easier to test without an HTTP client:

http://localhost:5000/api/v1/vision/caption?url=https%3A%2F%2Fdl.photoprism.app%2Fimg%2Fteam%2Favatar.jpg&id=3487da77-246e-4b4c-9437-67507177bcd7

API Endpoints

`/api/v1/vision/caption`

This is the default endpoint of the captioning API. An image url should be passed in with the key "url" or "images" that contains array of base64 encoded images, "model" used for inherence and optionally and/or "id" value can be passed in. The "model" key allows the user to specify which of the three models they would like to use. If no model is given, the application will default to configured model.

`/api/v1/vision/labels`

This is the default endpoint of the labels API. An image url should be passed in with the key "url" or "images" that contains array of base64 encoded images, "model" used for inherence and optionally and/or "id" value can be passed in. The "model" key allows the user to specify which of the three models they would like to use. If no model is given, the application will default to configured model.

`/api/v1/vision/nsfw`

This is the default endpoint of the nsfw API. An image url should be passed in with the key "url" or "images" that contains array of base64 encoded images, "model" used for inherence and optionally and/or "id" value can be passed in. The "model" key allows the user to specify which of the three models they would like to use. If no model is given, the application will default to configured model.

`/api/v1/vision/caption/<model_name>/<model_version>`

This is the endpoint for a generation of captions. For detailed output see ApiResponse and Caption classes in api.py

`/api/v1/vision/labels/<model_name>/<model_version>`

This is the endpoint for a generation of labels. For detailed output see ApiResponse and Labels classes in api.py

`/api/v1/vision/nsfw/<model_name>/<model_version>`

This is the endpoint for a generation of labels. For detailed output see ApiResponse and NSFW classes in api.py

Example Request

POST /api/v1/vision/caption

{
    "id": "b0db2187-7a09-438c-8649-a9c6c0f7b8a1",
    "model": "kosmos-2",
    "version": "latest",
    "prompt": "In up to 3 sentences, describe what you see in this image.",
    "images": [
      "data:image/png;base64,iVBORw0KGgo..."
    ]
}

Example Response

{
    "id": "b0db2187-7a09-438c-8649-a9c6c0f7b8a1",
    "model": {
        "name": "kosmos-2",
        "version": "patch14-224"
    },
    "result": {
        "caption": "An image of a man in a suit smiling."
    }
}

Code Structure

Internal API

There is predefined internal API in file api.py. The class ImageProcessor defines methods that any model should provide.

Model Loading and Initialization

Local models should extend TorchImageProcessor class that defines essential abstract methods required to be implemented.

_get_model_config returns dictionary with configuration keys path = path to the saved model, source = huggingface model name, version = tag of model.

Usually latest model is downloaded.

_download_model downloads specific model and persist it into models directory. _get_model_name returns name of model that will be used for selection based on request data _load_model loads chosen model into memory where it will stay until restart

Request Handlers

Defined in app.py

Default Endpoint

@app.route('/api/v1/vision/caption', methods=['POST', 'GET'])

This is the default endpoint. It checks to see if a model is specified, and if it is it calls the service associated with that model and returns the respose with the data. If a model isn't specified it uses kosmos-2.

Specific Endpoints

@app.route('/api/v1/vision/labels/<model_name>', methods=['POST', 'GET'])

There is the endpoint that dynamically routes request to model_name in url path variable.

Contributors

We would like to thank everyone involved, especially Aatif Dawawala for getting things started and contributing the initial code:

Learn more ›

Submitting Pull Requests

Follow our step-by-step guide to learn how to submit new features, bug fixes, and documentation enhancements.

Learn more ›

License and Disclaimer

The files in this repository are licensed under the Apache License, Version 2.0 (the “License”).

When adding dependencies, for example in requirements.txt, please make sure their licenses are compatible. While code in this repository may be used in projects distributed under the terms of the GNU General Public License (GPL) or AGPL v3, the reverse is not possible.

Except as required by applicable law or agreed to in writing, this software is distributed "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY IMPLIED WARRANTIES OR CONDITIONS OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.

Learn more ›

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
service		service
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sh		build.sh
compose.yaml		compose.yaml

License

photoprism/photoprism-vision

Folders and files

Latest commit

History

Repository files navigation

PhotoPrism® Computer Vision Models

Table of Contents

Local Models

Kosmos-2

VIT-GPT2

BLIP

nsfw_image_detector

Remote integrations

OLLAMA

Configuration

Models

Dependencies

Flask

PyTorch

Transformers

Pillow

pydantic

ollama

timm

huggingface_hub[hf_xet]

Build Setup

Usage

API Endpoints

/api/v1/vision/caption

/api/v1/vision/labels

/api/v1/vision/nsfw

/api/v1/vision/caption/<model_name>/<model_version>

/api/v1/vision/labels/<model_name>/<model_version>

/api/v1/vision/nsfw/<model_name>/<model_version>

Example Request

Example Response

Code Structure

Internal API

Model Loading and Initialization

Request Handlers

Default Endpoint

Specific Endpoints

Contributors

Submitting Pull Requests

License and Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 4

Languages

`/api/v1/vision/caption`

`/api/v1/vision/labels`

`/api/v1/vision/nsfw`

`/api/v1/vision/caption/<model_name>/<model_version>`

`/api/v1/vision/labels/<model_name>/<model_version>`

`/api/v1/vision/nsfw/<model_name>/<model_version>`