Dataset Generator

Open-source, generic synthetic dataset generator. Build the Docker image once and reuse it inside any external project's docker-compose.yml by mounting your configurations.

What is this?

This is a containerized synthetic dataset generation service designed for embedding into external projects. It provides:

Generic architecture: Works with any LLM provider (Ollama, OpenAI, etc.)
Template-driven generation: Customizable prompts and output structures
Multi-tenant support: Different configurations per agency/project
RESTful API: Generate datasets programmatically with background processing
Docker-first design: Deploy anywhere with consistent behavior

Quick Start

Clone and start services:

git clone https://github.com/rootcodelabs/Dataset-Generator.git
cd Dataset-Generator
docker compose up -d

Verify services are running:

curl http://localhost:8000/health

Generate your first dataset:

curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_structure": "single_question",
    "prompt_template": "institute_topic_question",
    "num_samples": 5,
    "language": "et"
  }'

Check generated datasets in ./output_datasets/

Use in Another Project

Option A: Use Published Image

Add this service to your existing docker-compose.yml:

version: '3'

services:
  # Your existing services...
  
  dataset-generator:
    image: synthesisai/dataset-generator:latest
    ports:
      - "8000:8000"
    environment:
      - PROVIDER_API_URL=http://your-llm-provider:11434
      - MLFLOW_TRACKING_URI=http://your-mlflow:5000
    volumes:
      - ./your-templates:/app/templates
      - ./your-configs:/app/user_configs
      - ./your-data:/app/data
      - ./generated-datasets:/app/output_datasets
      - ./logs:/app/logs
    networks:
      - your-network

  # Optional: Include Ollama if you don't have an LLM provider
  ollama:
    image: synthesisai/dataset-generator-ollama:latest
    ports:
      - "11434:11434"
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ollama_models:/root/.ollama
    networks:
      - your-network

Option B: Build & Push Custom Tag

Customize the service:

# Modify config/config.yaml, templates/, etc.
# Add your custom templates and configurations

Build with your tag:

docker build -f Dockerfile.service -t your-org/dataset-generator:v1.0 .
docker build -f Dockerfile.ollama-gpu -t your-org/dataset-generator-ollama:v1.0 .

Push to your registry:

docker push your-org/dataset-generator:v1.0
docker push your-org/dataset-generator-ollama:v1.0

Use in your project:

services:
  dataset-generator:
    image: your-org/dataset-generator:v1.0
    # ... rest of configuration

Directory Structure

Directory	Purpose	Mount Point	Example Content
`src/`	Application source code	Not mounted	API routes, core logic, providers
`templates/`	Prompt templates	`/app/templates`	`prompts/default/base_prompt.txt`
`user_configs/`	User configurations	`/app/user_configs`	`dataset_structures/single_question.yaml`
`config/`	Base configuration	`/app/config`	`config.yaml`, `model_config.yaml`
`data/`	Input data sources	`/app/data`	Your source documents/texts
`output_datasets/`	Generated datasets	`/app/output_datasets`	JSON/CSV output files
`logs/`	Application logs	`/app/logs`	`synthetic_data_service.log`

Configuration

Key environment variables for integration:

# LLM Provider
PROVIDER_API_URL=http://ollama:11434        # Your LLM service endpoint
MODEL_NAME=gemma3:1b-it-qat                 # Model to use
PROVIDER_NAME=ollama                        # Provider type

# MLflow (optional)
MLFLOW_TRACKING_URI=http://mlflow:5000      # Experiment tracking

# Service
SERVICE_DEBUG=false                         # Debug logging

API Usage

Generate datasets programmatically:

# Single generation
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_structure": "single_question",
    "prompt_template": "institute_topic_question", 
    "num_samples": 10,
    "language": "et",
    "parameters": {
      "temperature": 0.7,
      "difficulty": "medium"
    }
  }'

# Bulk generation with callback
curl -X POST "http://localhost:8000/generate/bulk" \
  -H "Content-Type: application/json" \
  -d '{
    "requests": [
      {
        "dataset_structure": "single_question",
        "prompt_template": "institute_topic_question",
        "num_samples": 5,
        "language": "et"
      }
    ],
    "callback_url": "http://your-service/callback"
  }'

Next Steps

See Architecture Documentation for system design
Check Configuration Guide for advanced setup
Review API Documentation for full endpoint reference
Explore Examples for common use casesitle: Open-source, generic synthetic dataset generator. Build the Docker image once and reuse it inside any external project’s docker-compose.yml by mounting your /configs.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.dvc		.dvc
config		config
data		data
docs		docs
mlflow		mlflow
scripts		scripts
src		src
templates/prompts/default		templates/prompts/default
user_configs		user_configs
.dvcignore		.dvcignore
.env.bedrock.example		.env.bedrock.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile.ollama-gpu		Dockerfile.ollama-gpu
Dockerfile.service		Dockerfile.service
LICENSE		LICENSE
README.md		README.md
datasets_requirements.txt		datasets_requirements.txt
docker-compose.yml		docker-compose.yml
mlflow.env		mlflow.env
pip		pip
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Generator

What is this?

Quick Start

Use in Another Project

Option A: Use Published Image

Option B: Build & Push Custom Tag

Directory Structure

Configuration

API Usage

Next Steps

About

Uh oh!

Releases

Packages

Languages

License

rootcodelabs/Dataset-Generator

Folders and files

Latest commit

History

Repository files navigation

Dataset Generator

What is this?

Quick Start

Use in Another Project

Option A: Use Published Image

Option B: Build & Push Custom Tag

Directory Structure

Configuration

API Usage

Next Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages