LeMaterial-Fetcher is designed to fetch data any external source, process it, and store it in a PostgreSQL database in a pre-defined format with structure-level validation and database-level validators. It is highly concurrent, to handle data fetching and processing efficiently.
The objective is to retrieve information from various sources and establish a local database that can be unified. This database will enable us to process and utilize the data according to our specific requirements, which can then be uploaded to an online and easily accessible place like Hugging Face.
Explore the datasets built with this tool on Hugging Face 🤗:
👉 LeMat-Bulk
👉 LeMat-Traj
This project relies entirely on the valuable contributions of several materials science database projects:
- Materials Project - A comprehensive database of computed materials properties funded by the U.S. Department of Energy and developed by the Lawrence Berkeley National Laboratory in collaboration with several other laboratories and universities
- Alexandria Library - A quantum-accurate materials library developed by ICAMS at Ruhr University Bochum
- Open Quantum Materials Database (OQMD) - An extensive collection of DFT calculated properties maintained by researchers at Northwestern University
We gratefully acknowledge these projects and their dedication to open materials science data. Our tool is built entirely on the foundation of their well-maintained databases and research efforts.
- Python 3.11 or later
- PostgreSQL database
- Environment variables set for configuration
-
Clone the repository:
git clone [email protected]:LeMaterial/lematerial-fetcher.git cd lematerial-fetcher
-
Set up your environment variables. Copy the provided template and customize it:
cp .env.example .env vim .env
-
Install the package:
# Using uv (recommended) uv add git+https://github.com/LeMaterial/lematerial-fetcher.git # or uv pip install git+https://github.com/LeMaterial/lematerial-fetcher.git # Or using pip pip install git+https://github.com/LeMaterial/lematerial-fetcher.git
The tool can be configured in two ways:
- Command-line arguments (recommended for most options)
- Environment variables (preferred for sensitive information)
For sensitive information like passwords, use environment variables with the LEMATERIALFETCHER_ prefix:
# Database credentials
export LEMATERIALFETCHER_DB_PASSWORD=your_password
export LEMATERIALFETCHER_MYSQL_PASSWORD=your_mysql_password
# Hugging Face credentials
export LEMATERIALFETCHER_HF_TOKEN=your_huggingface_tokenA template .env.example file is provided that you can copy to .env and customize.
The CLI provides several commands for different data sources and operations. Here's a comprehensive guide:
lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]--debug: Run operations in main process for debugging--cache-dir DIR: Directory for temporary data (default: ~/.cache/lematerial_fetcher)
-
Materials Project (MP)
# Fetch structures lematerial-fetcher mp fetch --table-name mp_structures --num-workers 4 # Fetch tasks lematerial-fetcher mp fetch --tasks --table-name mp_tasks # Transform data lematerial-fetcher mp transform --table-name source_table --dest-table-name dest_table
-
Alexandria
# Fetch structures lematerial-fetcher alexandria fetch --table-name alex_structures --functional pbe # Fetch trajectories lematerial-fetcher alexandria fetch --traj --table-name alex_trajectories # Transform data lematerial-fetcher alexandria transform --table-name source_table --dest-table-name dest_table
-
OQMD
# Fetch data lematerial-fetcher oqmd fetch --table-name oqmd_structures # Transform data lematerial-fetcher oqmd transform --table-name source_table --dest-table-name dest_table
-
Push to Hugging Face
lematerial-fetcher push --table-name my_table --hf-repo-id my-repo
These options are available across most commands:
--db-conn-str STR: Complete database connection string--db-user USER: Database username--db-host HOST: Database host (default: localhost)--db-name NAME: Database name (default: lematerial)
--num-workers N: Number of parallel workers--log-dir DIR: Directory for logs (default: ./logs)--max-retries N: Maximum retry attempts (default: 3)--retry-delay N: Delay between retries in seconds (default: 2)--log-every N: Log frequency (default: 1000)
--offset N: Starting offset (default: 0)--table-name NAME: Target table name--limit N: Items per API request (default: 500)
--batch-size N: Batch processing size (default: 500)--dest-table-name NAME: Destination table name--traj: Transform trajectory data
-
Fetch from Materials Project with custom configuration:
lematerial-fetcher mp fetch \ --table-name mp_structures \ --num-workers 4 \ --db-host localhost \ --db-name materials \ --log-dir ./mp_logs
-
Transform Alexandria data with source and destination databases:
lematerial-fetcher alexandria transform \ --table-name source_table \ --dest-table-name dest_table \ --batch-size 1000 \ --db-host source_host \ --dest-db-host dest_host
-
Push to Hugging Face with custom chunk size:
lematerial-fetcher push \ --table-name my_table \ --hf-repo-id my-repo \ --chunk-size 2000 \ --max-rows 10000
You can configure the database connection in two ways:
-
Using individual parameters:
# Set password in environment export LEMATERIALFETCHER_DB_PASSWORD=your_password # Run command lematerial-fetcher mp fetch --db-user username --db-name database_name
-
Using a connection string:
lematerial-fetcher mp fetch --db-conn-str="host=localhost user=username password=password dbname=database_name sslmode=disable"
MySQL-specific options:
--mysql-host HOST: MySQL host (default: localhost)--mysql-user USER: MySQL username--mysql-database NAME: MySQL database name (default: lematerial)--mysql-cert-path PATH: Path to MySQL SSL certificate
This project leverages data from several established materials databases. Please see ACKNOWLEDGEMENTS.md for complete information about the data sources used and proper citations for academic use.
This code base is the property of Entalpic and is licensed under the Apache License, Version 2.0 (the "License").
Copyright 2025 Entalpic
For easy deployment and execution, we provide a Docker setup that includes both PostgreSQL and MySQL databases. This setup allows you to run the entire pipeline with a single command.
docker build -t lematerial-fetcher .-
Basic Run
docker run -it lematerial-fetcher run-pipeline
-
With Hugging Face Integration
docker run -it \ -e LEMATERIALFETCHER_HF_TOKEN=your_huggingface_token \ lematerial-fetcher run-pipeline
-
Interactive Shell
docker run -it lematerial-fetcher bash
-
Persistent Data Storage
docker run -it \ -v $(pwd)/data:/app/data \ -v $(pwd)/logs:/app/logs \ lematerial-fetcher run-pipeline
The Docker container exposes the following ports:
- PostgreSQL: 5432
- MySQL: 3306
You can connect to the databases using these credentials:
-
PostgreSQL:
- Host: localhost
- Port: 5432
- User: root
- Password: root
- Database: lematerial
-
MySQL:
- Host: localhost
- Port: 3306
- User: root
- Password: root
- Database: lematerial
You can customize the setup using environment variables:
docker run -it \
-e LEMATERIALFETCHER_DB_PASSWORD=your_password \
-e LEMATERIALFETCHER_MYSQL_PASSWORD=your_mysql_password \
-e LEMATERIALFETCHER_HF_TOKEN=your_huggingface_token \
lematerial-fetcher run-pipeline