Utilities for searching images by visual similarity, using off-the-shelf open source technology.
In this repository, you will find:
- Python scripts for computing embedding vectors from folders of image files using img2vec.
- A Docker configuration for running the Qdrant vector search engine locally
- Python scripts for ingesting the embedding vectors into qdrant and running nearest-neighbour searches on lists of images
Scripts were developed specifically for the ONiT research project, which means there are some project-specific conventions and schemas baked into the scripts. Use at your own caution!
The script utils/compute_image_vectors.py performs the following steps:
- Starting from a configured folder path, it loads all
*.jpgimages from the folder and its subfolders. Images are not included in this repository! - For each image, it generates an embedding vector (using
efficientnet_b3). - Converts vectors to 256 dimensions using PCA.
- Loads a CSV file with publicly accessible IIIF links for each image. (Example file included in this repository here)
- Generates a result file with the following data for each image. (The result file in in JSONL format, and written to the
data/vectorsfolder).- image identifier (= filename without
.jpgextension) - IIIF image URL
- embedding vector
- image identifier (= filename without
To enable fast similarity search, image embedding vectors are loaded into a Qdrant database. The database
folder contains a docker-compose.yml file which starts an empty Qdrant instance on the default port (6333).
- Run
docker compose upto start the database server. - Run
python init.pyto initialize an empty database collection (namedonit), with a schema matching our image embedding data. - Run
python ingest.pyto import the JSONL data file (generated in the last step) to Qdrant. - The
search_example.pyscript shows how you can run a nearest neighbour search for a specific image, using its ID as a query parameter. Due to the way Qdrant works, this is a two-step query:- A first query is needed to retrieve the vector for an image, given its ID
- Using the vector as an input, a second query retrieves a list of N nearby vectors (and image records)
- Note that the second query will include the original query image as well.
The script utils/query_similarities.py takes a list of image IDs as input, and runs a bulk-search for N (currently configured to 50) nearest neighbours in Qdrant. The output is written to a JSON file.
The JSON file contains an array of reference images, and their neighbours. The script includes the score for each neighbour, a number delivered by Qdrant as a measure of relative similarity.
The script utils/generate_html_preview.py takes the JSON similarity result as input, and generates an HTML preview file.