Skip to content

LeMaterial/lematerial-fetcher

Repository files navigation

LeMaterial

LeMaterial-Fetcher

LeMaterial-Fetcher is designed to fetch data any external source, process it, and store it in a PostgreSQL database in a pre-defined format with structure-level validation and database-level validators. It is highly concurrent, to handle data fetching and processing efficiently.

The objective is to retrieve information from various sources and establish a local database that can be unified. This database will enable us to process and utilize the data according to our specific requirements, which can then be uploaded to an online and easily accessible place like Hugging Face.

Explore the datasets built with this tool on Hugging Face 🤗:

👉 LeMat-Bulk

👉 LeMat-Traj

Data Credits

This project relies entirely on the valuable contributions of several materials science database projects:

  • Materials Project - A comprehensive database of computed materials properties funded by the U.S. Department of Energy and developed by the Lawrence Berkeley National Laboratory in collaboration with several other laboratories and universities
  • Alexandria Library - A quantum-accurate materials library developed by ICAMS at Ruhr University Bochum
  • Open Quantum Materials Database (OQMD) - An extensive collection of DFT calculated properties maintained by researchers at Northwestern University

We gratefully acknowledge these projects and their dedication to open materials science data. Our tool is built entirely on the foundation of their well-maintained databases and research efforts.

Prerequisites

  • Python 3.11 or later
  • PostgreSQL database
  • Environment variables set for configuration

Installation

  1. Clone the repository:

    git clone [email protected]:LeMaterial/lematerial-fetcher.git
    cd lematerial-fetcher
  2. Set up your environment variables. Copy the provided template and customize it:

    cp .env.example .env
    vim .env
  3. Install the package:

    # Using uv (recommended)
    uv add git+https://github.com/LeMaterial/lematerial-fetcher.git
    # or
    uv pip install git+https://github.com/LeMaterial/lematerial-fetcher.git
    
    # Or using pip
    pip install git+https://github.com/LeMaterial/lematerial-fetcher.git

Configuration

The tool can be configured in two ways:

  1. Command-line arguments (recommended for most options)
  2. Environment variables (preferred for sensitive information)

Environment Variables

For sensitive information like passwords, use environment variables with the LEMATERIALFETCHER_ prefix:

# Database credentials
export LEMATERIALFETCHER_DB_PASSWORD=your_password
export LEMATERIALFETCHER_MYSQL_PASSWORD=your_mysql_password

# Hugging Face credentials
export LEMATERIALFETCHER_HF_TOKEN=your_huggingface_token

A template .env.example file is provided that you can copy to .env and customize.

Usage

The CLI provides several commands for different data sources and operations. Here's a comprehensive guide:

Basic Structure

lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]

Global Options

  • --debug: Run operations in main process for debugging
  • --cache-dir DIR: Directory for temporary data (default: ~/.cache/lematerial_fetcher)

Available Commands

  1. Materials Project (MP)

    # Fetch structures
    lematerial-fetcher mp fetch --table-name mp_structures --num-workers 4
    
    # Fetch tasks
    lematerial-fetcher mp fetch --tasks --table-name mp_tasks
    
    # Transform data
    lematerial-fetcher mp transform --table-name source_table --dest-table-name dest_table
  2. Alexandria

    # Fetch structures
    lematerial-fetcher alexandria fetch --table-name alex_structures --functional pbe
    
    # Fetch trajectories
    lematerial-fetcher alexandria fetch --traj --table-name alex_trajectories
    
    # Transform data
    lematerial-fetcher alexandria transform --table-name source_table --dest-table-name dest_table
  3. OQMD

    # Fetch data
    lematerial-fetcher oqmd fetch --table-name oqmd_structures
    
    # Transform data
    lematerial-fetcher oqmd transform --table-name source_table --dest-table-name dest_table
  4. Push to Hugging Face

    lematerial-fetcher push --table-name my_table --hf-repo-id my-repo

Common Options

These options are available across most commands:

Database Options

  • --db-conn-str STR: Complete database connection string
  • --db-user USER: Database username
  • --db-host HOST: Database host (default: localhost)
  • --db-name NAME: Database name (default: lematerial)

Processing Options

  • --num-workers N: Number of parallel workers
  • --log-dir DIR: Directory for logs (default: ./logs)
  • --max-retries N: Maximum retry attempts (default: 3)
  • --retry-delay N: Delay between retries in seconds (default: 2)
  • --log-every N: Log frequency (default: 1000)

Fetch Options

  • --offset N: Starting offset (default: 0)
  • --table-name NAME: Target table name
  • --limit N: Items per API request (default: 500)

Transformer Options

  • --batch-size N: Batch processing size (default: 500)
  • --dest-table-name NAME: Destination table name
  • --traj: Transform trajectory data

Examples

  1. Fetch from Materials Project with custom configuration:

    lematerial-fetcher mp fetch \
      --table-name mp_structures \
      --num-workers 4 \
      --db-host localhost \
      --db-name materials \
      --log-dir ./mp_logs
  2. Transform Alexandria data with source and destination databases:

    lematerial-fetcher alexandria transform \
      --table-name source_table \
      --dest-table-name dest_table \
      --batch-size 1000 \
      --db-host source_host \
      --dest-db-host dest_host
  3. Push to Hugging Face with custom chunk size:

    lematerial-fetcher push \
      --table-name my_table \
      --hf-repo-id my-repo \
      --chunk-size 2000 \
      --max-rows 10000

Database Configuration

PostgreSQL Configuration

You can configure the database connection in two ways:

  1. Using individual parameters:

    # Set password in environment
    export LEMATERIALFETCHER_DB_PASSWORD=your_password
    
    # Run command
    lematerial-fetcher mp fetch --db-user username --db-name database_name
  2. Using a connection string:

    lematerial-fetcher mp fetch --db-conn-str="host=localhost user=username password=password dbname=database_name sslmode=disable"

MySQL Configuration (for OQMD)

MySQL-specific options:

  • --mysql-host HOST: MySQL host (default: localhost)
  • --mysql-user USER: MySQL username
  • --mysql-database NAME: MySQL database name (default: lematerial)
  • --mysql-cert-path PATH: Path to MySQL SSL certificate

Acknowledgements

This project leverages data from several established materials databases. Please see ACKNOWLEDGEMENTS.md for complete information about the data sources used and proper citations for academic use.

License and copyright

This code base is the property of Entalpic and is licensed under the Apache License, Version 2.0 (the "License").

Copyright 2025 Entalpic

Docker Setup

For easy deployment and execution, we provide a Docker setup that includes both PostgreSQL and MySQL databases. This setup allows you to run the entire pipeline with a single command.

Building the Docker Image

docker build -t lematerial-fetcher .

Running the Pipeline

  1. Basic Run

    docker run -it lematerial-fetcher run-pipeline
  2. With Hugging Face Integration

    docker run -it \
      -e LEMATERIALFETCHER_HF_TOKEN=your_huggingface_token \
      lematerial-fetcher run-pipeline
  3. Interactive Shell

    docker run -it lematerial-fetcher bash
  4. Persistent Data Storage

    docker run -it \
      -v $(pwd)/data:/app/data \
      -v $(pwd)/logs:/app/logs \
      lematerial-fetcher run-pipeline

Database Access

The Docker container exposes the following ports:

  • PostgreSQL: 5432
  • MySQL: 3306

You can connect to the databases using these credentials:

  • PostgreSQL:

    • Host: localhost
    • Port: 5432
    • User: root
    • Password: root
    • Database: lematerial
  • MySQL:

    • Host: localhost
    • Port: 3306
    • User: root
    • Password: root
    • Database: lematerial

Environment Variables

You can customize the setup using environment variables:

docker run -it \
  -e LEMATERIALFETCHER_DB_PASSWORD=your_password \
  -e LEMATERIALFETCHER_MYSQL_PASSWORD=your_mysql_password \
  -e LEMATERIALFETCHER_HF_TOKEN=your_huggingface_token \
  lematerial-fetcher run-pipeline

About

Tool to fetch, parse, and standardize materials data from various databases for LeMaterial.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •