Skip to content

rootcodelabs/Dataset-Generator

 
 

Repository files navigation

Dataset Generator

Open-source, generic synthetic dataset generator. Build the Docker image once and reuse it inside any external project's docker-compose.yml by mounting your configurations.

What is this?

This is a containerized synthetic dataset generation service designed for embedding into external projects. It provides:

  • Generic architecture: Works with any LLM provider (Ollama, OpenAI, etc.)
  • Template-driven generation: Customizable prompts and output structures
  • Multi-tenant support: Different configurations per agency/project
  • RESTful API: Generate datasets programmatically with background processing
  • Docker-first design: Deploy anywhere with consistent behavior

Quick Start

  1. Clone and start services:
git clone https://github.com/rootcodelabs/Dataset-Generator.git
cd Dataset-Generator
docker compose up -d
  1. Verify services are running:
curl http://localhost:8000/health
  1. Generate your first dataset:
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_structure": "single_question",
    "prompt_template": "institute_topic_question",
    "num_samples": 5,
    "language": "et"
  }'

Check generated datasets in ./output_datasets/

Use in Another Project

Option A: Use Published Image

Add this service to your existing docker-compose.yml:

version: '3'

services:
  # Your existing services...
  
  dataset-generator:
    image: synthesisai/dataset-generator:latest
    ports:
      - "8000:8000"
    environment:
      - PROVIDER_API_URL=http://your-llm-provider:11434
      - MLFLOW_TRACKING_URI=http://your-mlflow:5000
    volumes:
      - ./your-templates:/app/templates
      - ./your-configs:/app/user_configs
      - ./your-data:/app/data
      - ./generated-datasets:/app/output_datasets
      - ./logs:/app/logs
    networks:
      - your-network

  # Optional: Include Ollama if you don't have an LLM provider
  ollama:
    image: synthesisai/dataset-generator-ollama:latest
    ports:
      - "11434:11434"
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ollama_models:/root/.ollama
    networks:
      - your-network

Option B: Build & Push Custom Tag

  1. Customize the service:
# Modify config/config.yaml, templates/, etc.
# Add your custom templates and configurations
  1. Build with your tag:
docker build -f Dockerfile.service -t your-org/dataset-generator:v1.0 .
docker build -f Dockerfile.ollama-gpu -t your-org/dataset-generator-ollama:v1.0 .
  1. Push to your registry:
docker push your-org/dataset-generator:v1.0
docker push your-org/dataset-generator-ollama:v1.0
  1. Use in your project:
services:
  dataset-generator:
    image: your-org/dataset-generator:v1.0
    # ... rest of configuration

Directory Structure

Directory Purpose Mount Point Example Content
src/ Application source code Not mounted API routes, core logic, providers
templates/ Prompt templates /app/templates prompts/default/base_prompt.txt
user_configs/ User configurations /app/user_configs dataset_structures/single_question.yaml
config/ Base configuration /app/config config.yaml, model_config.yaml
data/ Input data sources /app/data Your source documents/texts
output_datasets/ Generated datasets /app/output_datasets JSON/CSV output files
logs/ Application logs /app/logs synthetic_data_service.log

Configuration

Key environment variables for integration:

# LLM Provider
PROVIDER_API_URL=http://ollama:11434        # Your LLM service endpoint
MODEL_NAME=gemma3:1b-it-qat                 # Model to use
PROVIDER_NAME=ollama                        # Provider type

# MLflow (optional)
MLFLOW_TRACKING_URI=http://mlflow:5000      # Experiment tracking

# Service
SERVICE_DEBUG=false                         # Debug logging

API Usage

Generate datasets programmatically:

# Single generation
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_structure": "single_question",
    "prompt_template": "institute_topic_question", 
    "num_samples": 10,
    "language": "et",
    "parameters": {
      "temperature": 0.7,
      "difficulty": "medium"
    }
  }'

# Bulk generation with callback
curl -X POST "http://localhost:8000/generate/bulk" \
  -H "Content-Type: application/json" \
  -d '{
    "requests": [
      {
        "dataset_structure": "single_question",
        "prompt_template": "institute_topic_question",
        "num_samples": 5,
        "language": "et"
      }
    ],
    "callback_url": "http://your-service/callback"
  }'

Next Steps

  • See Architecture Documentation for system design
  • Check Configuration Guide for advanced setup
  • Review API Documentation for full endpoint reference
  • Explore Examples for common use casesitle: Open-source, generic synthetic dataset generator. Build the Docker image once and reuse it inside any external project’s docker-compose.yml by mounting your /configs.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • Other 0.7%