LamAPI provides semantic access to Wikidata: you can index dumps into MongoDB, compute rich type hierarchies, and expose everything through a REST API or directly from Python. This document covers how to run the service, use the CLI tooling, and build the search index.
- Python library (
lamapi/) – core retrieval logic (types, literals, objects, summaries, etc.). - REST API (
lamapi/server.py) – Flask application wrapping the library and serving JSON responses. - Data pipeline scripts (
scripts/) – tooling for extracting edges, building SQLite closures, parsing Wikidata dumps, and materialising Elasticsearch/MongoDB indices. - Docker assets –
docker-compose*.ymlfiles wire MongoDB, Elasticsearch, and the LamAPI service for local development or production.
- Docker + Docker Compose (recommended for API usage).
- Python 3.9+ with
pipif you plan to run scripts locally. - Sufficient disk space (> 300 GB recommended) for dumps, SQLite DB, and MongoDB.
- API base URL:
http://localhost:5000(Swagger UI available at root). - MongoDB exposed on
localhost:27017, Elasticsearch onlocalhost:9200. - Environment variables can be customised through
.envor compose overrides.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
python lamapi/server.pySet the required environment variables first:
# Cluster Configuration
CLUSTER_NAME=lamapi
LICENSE=basic
STACK_VERSION=8.8.1
# Elasticsearch Configuration
ELASTICSEARCH_USERNAME=lamapi
ELASTIC_PASSWORD=<elastic_username>
ELASTIC_ENDPOINT=es01:9200
ELASTIC_FINGERPRINT=
ELASTIC_PORT=9200
# Kibana Configuration
KIBANA_PASSWORD=<kibana_password>
KIBANA_PORT=5601
# MongoDB Configuration
MONGO_ENDPOINT=mongo:27017
MONGO_INITDB_ROOT_USERNAME=<mongo_username>
MONGO_INITDB_ROOT_PASSWORD=<mongo_password>
MONGO_PORT=27017
MONGO_VERSION=6.0
# Other Configuration
THREADS=8
PYTHON_VERSION=3.11
LAMAPI_TOKEN=<your_token>
LAMAPI_PORT=5000
SUPPORTED_KGS=WIKIDATA
MEM_LIMIT=2G
# Connection Strategy
# - Set `LAMAPI_RUNTIME=docker` inside containers (Dockerfile already does this).
# - Leave it unset or `auto` when running locally: LamAPI will rewrite known service
# hosts such as `mongo` or `es01` to `localhost` so CLI commands work out of the box.
# - Override `LAMAPI_LOCAL_MONGO_HOST` / `LAMAPI_LOCAL_ELASTIC_HOST` if your databases
# live on a different machine.Import the package when you need programmatic access to retrievers:
from lamapi import Database, TypesRetriever
db = Database()
retriever = TypesRetriever(db)
types = retriever.get_types_output(["Q30"], kg="wikidata")Scripts in scripts/ provide CLI utilities for data preparation and indexing.
LamAPI ships with multiple namespaces (Swagger UI shows schemas and payloads):
| Namespace | Endpoint | Method | Description |
|---|---|---|---|
info |
/info/kgs |
GET | List supported knowledge graphs. |
entity |
/entity/types |
POST | Retrieve explicit + extended types for entities. |
entity |
/entity/objects |
POST | Fetch object neighbours for entities. |
entity |
/entity/literals |
POST | Fetch literal attributes. |
entity |
/entity/predicates |
POST | Retrieve predicates and relations. |
lookup |
/lookup/search |
GET | Free-text entity lookup. |
lookup |
/lookup/sameas |
POST | Same-as entity discovery. |
sti |
/sti/column-analysis |
POST | Semantic table interpretation helpers. |
sti |
/sti/ner |
POST | Named-entity recognition utilities. |
classify |
/classify/literals |
POST | Literal classifier outputs. |
summary |
/summary/statistics |
GET | Dataset-level statistics. |
Confirm contracts via Swagger at http://localhost:5000 once the API is running.
Transform an official Wikidata dump into the artefacts LamAPI expects. Run the steps from the repository root (emd-lamapi/).
-
Download the Wikidata JSON dump
mkdir -p data/wikidata curl -o data/wikidata/latest-all.json.bz2 \ https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
-
Extract type hierarchy edges using
scripts/extract_type_hierarchy.pyto produceinstance_of.tsv(P31) andsubclass_of.tsv(P279).python3 scripts/extract_type_hierarchy.py \ --input data/wikidata/latest-all.json.bz2 \ --output-instance data/wikidata/instance_of.tsv \ --output-subclass data/wikidata/subclass_of.tsv # Stream with pbzip2 for better throughput pbzip2 -dc data/wikidata/latest-all.json.bz2 | \ python3 scripts/extract_type_hierarchy.py --stdin-json \ --output-instance data/wikidata/instance_of.tsv \ --output-subclass data/wikidata/subclass_of.tsv
-
Compute the type transitive closure with
scripts/infer_types.py, creatingtypes.db.python3 scripts/infer_types.py \ --instance-of data/wikidata/instance_of.tsv \ --subclass-of data/wikidata/subclass_of.tsv \ --output-db data/wikidata/types.db
Add
--no-closureto skip materialising the closure if you only need raw edges. -
Parse the Wikidata dump into MongoDB using the parallel ingestion script.
# Stream with pbzip2 for better throughput pbzip2 -dc data/wikidata/latest-all.json.bz2 | python3 scripts/parse_wikidata_dump_parallel.py \ --input data/wikidata/latest-all.json.bz2 \ --types-db-path types.db \ --threads 16
Inspect
python3 scripts/parse_wikidata_dump_parallel.py --helpfor batching and worker options. -
Create Elasticsearch/MongoDB indices using
scripts/indexing.pywith a configuration underscripts/index_confs/.python3 ./scripts/indexing.py index \ --db_name mydb \ --collection_name mycoll \ --mapping_file ./scripts/mapping.json \ --batch-size 8192 \ --max-threads 4
Tweak the JSON config to control retrievers, filters, and indexed fields.
After these steps LamAPI can answer type, lookup, literal, and object queries against the populated databases.
| Script | Purpose |
|---|---|
scripts/extract_type_hierarchy.py |
Multi-threaded extractor for P31/P279 edges; supports streaming input. |
scripts/infer_types.py |
Builds a SQLite DB with transitive closure to accelerate type lookups. |
scripts/parse_wikidata_dump_parallel.py |
Ingests Wikidata entities into MongoDB using a threaded pipeline. |
scripts/indexing.py |
Materialises search indices based on JSON configs. |
scripts/experiments.py, scripts/summary.py, etc. |
Additional analytics helpers and evaluations. |
Run python3 <script> --help for the complete argument list.
- Prefer streaming the dump through
pbzip2to keep CPUs saturated while avoiding disk bottlenecks. extract_type_hierarchy.pyandinfer_types.pydeduplicate edges withINSERT OR IGNORE, so re-running them is safe.- Store TSVs and SQLite DBs inside a dedicated
data/directory; they can reach tens of GB. - Never commit
.envfiles or credentials—.gitignorealready excludes them. - Swagger UI at
http://localhost:5000/docsis the quickest way to try endpoints once the stack is live.
Feel free to explore the code and examples in this repository for a deeper understanding of how the entity types are defined, extended, and mapped to NER categories.