Overview

This repository holds the business logic for building and managing the data pipelines used to power various data services at MIT Open Learning. The core framework is Dagster which provides a flexible, and well structured approach to building data applications.

Running Dagster Locally via Docker

Ensure that you have the latest version of Docker installed. https://www.docker.com/products/docker-desktop/
Install docker compose. Check the documentation and requirements for your specific machine. https://docs.docker.com/compose/install/
Ensure you are able to authenticate into GitHub + Vault https://github.com/mitodl/ol-data-platform/tree/main https://vault-qa.odl.mit.edu/v1/auth/github/login vault login -address=https://vault-qa.odl.mit.edu -method=github https://vault-production.odl.mit.edu/v1/auth/github/login vault login -address=https://vault-production.odl.mit.edu -method=github
Ensure you create your .env file and populate it with the environment variables. cp .env.example .env
Call docker compose up docker compose up --build
Navigate to localhost:3000 to access the Dagster UI

dbt Staging Model Generation

This repository includes a script for automatically generating dbt source definitions and staging models from database tables. The script is located at bin/dbt-create-staging-models.py.

Prerequisites

Python environment with required dependencies (see pyproject.toml)
dbt environment configured with appropriate credentials
Access to the target database/warehouse

Usage

The script provides three main commands:

1. Generate Sources Only

uv run python bin/dbt-create-staging-models.py generate-sources \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production

2. Generate Staging Models Only

uv run python bin/dbt-create-staging-models.py generate-staging-models \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production

3. Generate Both Sources and Staging Models

uv run python bin/dbt-create-staging-models.py generate-all \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production

Parameters

--schema: The database schema to scan for tables (e.g., ol_warehouse_production_raw)
--prefix: The table prefix to filter by (e.g., raw__mitlearn__app__postgres__user)
--target: The dbt target environment to use (production, qa, dev, etc.)
--database: (Optional) Specify the database name if different from target default
--directory: (Optional) Override the subdirectory within models/staging/
--apply-transformations: (Optional) Apply semantic transformations (default: True)
--entity-type: (Optional) Override auto-detection of entity type (user, course, courserun, etc.)

How It Works

Domain Detection: Extracts the domain from the prefix (e.g., mitlearn from raw__mitlearn__app__postgres__)
Entity Detection: Automatically detects entity type from table name for semantic transformations
File Organization: Creates files in src/ol_dbt/models/staging/{domain}/
Source Generation: Uses dbt-codegen to discover matching tables and generate source definitions
Enhanced Staging Models: Creates SQL and YAML files with automatic transformations applied
Merging: Automatically merges new tables with existing source files

Enhanced Staging Model Generation

The script now includes an enhanced macro that automatically applies common transformation patterns:

Automatic Transformations

Semantic Column Renaming: id → {entity}_id, title → {entity}_title
Timestamp Standardization: Converts all timestamps to ISO8601 format
Boolean Normalization: Ensures consistent boolean field naming
Data Quality: Automatic deduplication for Airbyte sync issues
String Cleaning: Handles multiple spaces in user names

Entity Type Detection

The system auto-detects entity types from table names:

user tables → User-specific transformations
course tables → Course-specific transformations
courserun tables → Course run transformations
video, program, website → Respective entity transformations

File Organization: Creates files in src/ol_dbt/models/staging/{domain}/
Source Generation: Uses dbt-codegen to discover matching tables and generate source definitions
Staging Models: Creates SQL and YAML files for each discovered table
Merging: Automatically merges new tables with existing source files

Generated Files

Sources File

Location: src/ol_dbt/models/staging/{domain}/_{domain}__sources.yml
Format: Standard dbt sources configuration with dynamic schema references
Merging: Automatically merges with existing source definitions

Staging Models

SQL Files: stg_{domain}__{table_name}.sql - Generated base models with enhanced transformations and explicit column selections
YAML File: _stg_{domain}__models.yml - Consolidated model schema definitions for all staging models in the domain

Examples

Generate MITlearn User Tables with Enhanced Transformations

python bin/dbt-create-staging-models.py generate-all \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production

Generate Without Transformations (Legacy Mode)

python bin/dbt-create-staging-models.py generate-all \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production \
    --no-apply-transformations

Override Entity Type Detection

python bin/dbt-create-staging-models.py generate-all \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production \
    --entity-type user

Basic Generate MITlearn User Tables

python bin/dbt-create-staging-models.py generate-all \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__user \
    --target production

This creates:

src/ol_dbt/models/staging/mitlearn/_mitlearn__sources.yml - Source definitions
src/ol_dbt/models/staging/mitlearn/_stg_mitlearn__models.yml - Consolidated model definitions
src/ol_dbt/models/staging/mitlearn/stg_mitlearn__raw__mitlearn__app__postgres__users_user.sql - Individual SQL files
Additional SQL files for other discovered user-related tables

Add Additional Tables to Existing Sources

python bin/dbt-create-staging-models.py generate-sources \
    --schema ol_warehouse_production_raw \
    --prefix raw__mitlearn__app__postgres__auth \
    --target production

This merges auth-related tables into the existing _mitlearn__sources.yml file.

Notes

The script follows existing dbt project conventions and naming patterns
Source files use the standard ol_warehouse_raw_data source with dynamic schema configuration
Generated staging models reference the correct source and include all discovered columns
The script handles YAML merging to avoid duplicating source definitions

UV Operations Utility

This repository includes a utility script for running uv commands across all code locations in the dg_deployment/code_locations directory. The script is located at bin/uv-operations.py.

Overview

The uv-operations.py script automatically discovers all directories containing a pyproject.toml file in the code locations directory and executes the specified uv command on each one. This is useful for operations like:

Synchronizing dependencies across all code locations (uv sync)
Upgrading lock files (uv lock --upgrade)
Building packages (uv build)
Listing installed packages (uv pip list)

Usage

Basic Command

python bin/uv-operations.py <uv-command> [args...]

Or run it directly as an executable:

./bin/uv-operations.py <uv-command> [args...]

Examples

Sync all code locations

python bin/uv-operations.py sync

Upgrade lock files

python bin/uv-operations.py lock --upgrade

List packages in all locations

python bin/uv-operations.py pip list

Continue on errors

By default, the script stops at the first failure. To continue processing all locations even if some fail:

python bin/uv-operations.py sync --continue-on-error

Verbose output

For detailed output showing the exact commands being run:

python bin/uv-operations.py sync --verbose

Parameters

--code-locations-dir: Base directory containing code locations (default: dg_deployment/code_locations)
--continue-on-error: Continue running even if some locations fail
--verbose: Print verbose output including the full command being executed

Output

The script provides:

A list of discovered code locations
Progress indicators for each location being processed
Success (✓) or failure (✗) markers for each location
A summary at the end showing successful and failed operations

Name		Name	Last commit message	Last commit date
Latest commit History 2,182 Commits
.gemini		.gemini
.github/workflows		.github/workflows
bin		bin
dg_deployments/local		dg_deployments/local
dg_projects		dg_projects
dockerfiles/orchestrate		dockerfiles/orchestrate
packages/ol-orchestrate-lib		packages/ol-orchestrate-lib
scripts		scripts
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.sqlfluffignore		.sqlfluffignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.yaml		build.yaml
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
renovate.json		renovate.json
setup.cfg		setup.cfg
uv.lock		uv.lock

Uh oh!

License

Uh oh!

mitodl/ol-data-platform

Folders and files

Latest commit

History

Repository files navigation

Overview

Running Dagster Locally via Docker

dbt Staging Model Generation

Prerequisites

Usage

1. Generate Sources Only

2. Generate Staging Models Only

3. Generate Both Sources and Staging Models

Parameters

How It Works

Enhanced Staging Model Generation

Automatic Transformations

Entity Type Detection

Generated Files

Sources File

Staging Models

Examples

Generate MITlearn User Tables with Enhanced Transformations

Generate Without Transformations (Legacy Mode)

Override Entity Type Detection

Basic Generate MITlearn User Tables

Add Additional Tables to Existing Sources

Notes

UV Operations Utility

Overview

Usage

Basic Command

Examples

Sync all code locations

Upgrade lock files

List packages in all locations

Continue on errors

Verbose output

Parameters

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 19

Uh oh!

Languages

Packages