ONiT Image Similarity

Utilities for searching images by visual similarity, using off-the-shelf open source technology.

Overview

In this repository, you will find:

Python scripts for computing embedding vectors from folders of image files using img2vec.
A Docker configuration for running the Qdrant vector search engine locally
Python scripts for ingesting the embedding vectors into qdrant and running nearest-neighbour searches on lists of images

Scripts were developed specifically for the ONiT research project, which means there are some project-specific conventions and schemas baked into the scripts. Use at your own caution!

Generating Image Embedding Vectors

The script utils/compute_image_vectors.py performs the following steps:

Starting from a configured folder path, it loads all *.jpg images from the folder and its subfolders. Images are not included in this repository!
For each image, it generates an embedding vector (using efficientnet_b3).
Converts vectors to 256 dimensions using PCA.
Loads a CSV file with publicly accessible IIIF links for each image. (Example file included in this repository here)
Generates a result file with the following data for each image. (The result file in in JSONL format, and written to the data/vectors folder).
- image identifier (= filename without .jpg extension)
- IIIF image URL
- embedding vector

Bootstrapping Qdrant

To enable fast similarity search, image embedding vectors are loaded into a Qdrant database. The database folder contains a docker-compose.yml file which starts an empty Qdrant instance on the default port (6333).

Run docker compose up to start the database server.
Run python init.py to initialize an empty database collection (named onit), with a schema matching our image embedding data.
Run python ingest.py to import the JSONL data file (generated in the last step) to Qdrant.
The search_example.py script shows how you can run a nearest neighbour search for a specific image, using its ID as a query parameter. Due to the way Qdrant works, this is a two-step query:
- A first query is needed to retrieve the vector for an image, given its ID
- Using the vector as an input, a second query retrieves a list of N nearby vectors (and image records)
- Note that the second query will include the original query image as well.

Bulk Similarity Utility

The script utils/query_similarities.py takes a list of image IDs as input, and runs a bulk-search for N (currently configured to 50) nearest neighbours in Qdrant. The output is written to a JSON file.

The JSON file contains an array of reference images, and their neighbours. The script includes the score for each neighbour, a number delivered by Qdrant as a measure of relative similarity.

HTML Preview

The script utils/generate_html_preview.py takes the JSON similarity result as input, and generates an HTML preview file.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
database		database
ui		ui
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ONiT Image Similarity

Overview

Generating Image Embedding Vectors

Bootstrapping Qdrant

Bulk Similarity Utility

HTML Preview

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ONiT-project/onit-image-similarity

Folders and files

Latest commit

History

Repository files navigation

ONiT Image Similarity

Overview

Generating Image Embedding Vectors

Bootstrapping Qdrant

Bulk Similarity Utility

HTML Preview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages