This repository holds the business logic for building and managing the data pipelines used to power various data services at MIT Open Learning. The core framework is Dagster which provides a flexible, and well structured approach to building data applications.
- Ensure that you have the latest version of Docker installed. https://www.docker.com/products/docker-desktop/
- Install docker compose. Check the documentation and requirements for your specific machine. https://docs.docker.com/compose/install/
- Ensure you are able to authenticate into GitHub + Vault
https://github.com/mitodl/ol-data-platform/tree/main
https://vault-qa.odl.mit.edu/v1/auth/github/login
vault login -address=https://vault-qa.odl.mit.edu -method=githubhttps://vault-production.odl.mit.edu/v1/auth/github/loginvault login -address=https://vault-production.odl.mit.edu -method=github - Ensure you create your .env file and populate it with the environment variables.
cp .env.example .env - Call docker compose up
docker compose up --build - Navigate to localhost:3000 to access the Dagster UI
This repository includes a script for automatically generating dbt source definitions and staging models from database tables. The script is located at bin/dbt-create-staging-models.py.
- Python environment with required dependencies (see
pyproject.toml) - dbt environment configured with appropriate credentials
- Access to the target database/warehouse
The script provides three main commands:
uv run python bin/dbt-create-staging-models.py generate-sources \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target productionuv run python bin/dbt-create-staging-models.py generate-staging-models \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target productionuv run python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production--schema: The database schema to scan for tables (e.g.,ol_warehouse_production_raw)--prefix: The table prefix to filter by (e.g.,raw__mitlearn__app__postgres__user)--target: The dbt target environment to use (production,qa,dev, etc.)--database: (Optional) Specify the database name if different from target default--directory: (Optional) Override the subdirectory withinmodels/staging/--apply-transformations: (Optional) Apply semantic transformations (default: True)--entity-type: (Optional) Override auto-detection of entity type (user, course, courserun, etc.)
- Domain Detection: Extracts the domain from the prefix (e.g.,
mitlearnfromraw__mitlearn__app__postgres__) - Entity Detection: Automatically detects entity type from table name for semantic transformations
- File Organization: Creates files in
src/ol_dbt/models/staging/{domain}/ - Source Generation: Uses dbt-codegen to discover matching tables and generate source definitions
- Enhanced Staging Models: Creates SQL and YAML files with automatic transformations applied
- Merging: Automatically merges new tables with existing source files
The script now includes an enhanced macro that automatically applies common transformation patterns:
- Semantic Column Renaming:
id→{entity}_id,title→{entity}_title - Timestamp Standardization: Converts all timestamps to ISO8601 format
- Boolean Normalization: Ensures consistent boolean field naming
- Data Quality: Automatic deduplication for Airbyte sync issues
- String Cleaning: Handles multiple spaces in user names
The system auto-detects entity types from table names:
usertables → User-specific transformationscoursetables → Course-specific transformationscourseruntables → Course run transformationsvideo,program,website→ Respective entity transformations
- File Organization: Creates files in
src/ol_dbt/models/staging/{domain}/ - Source Generation: Uses dbt-codegen to discover matching tables and generate source definitions
- Staging Models: Creates SQL and YAML files for each discovered table
- Merging: Automatically merges new tables with existing source files
- Location:
src/ol_dbt/models/staging/{domain}/_{domain}__sources.yml - Format: Standard dbt sources configuration with dynamic schema references
- Merging: Automatically merges with existing source definitions
- SQL Files:
stg_{domain}__{table_name}.sql- Generated base models with enhanced transformations and explicit column selections - YAML File:
_stg_{domain}__models.yml- Consolidated model schema definitions for all staging models in the domain
python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target productionpython bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production \
--no-apply-transformationspython bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production \
--entity-type userpython bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target productionThis creates:
src/ol_dbt/models/staging/mitlearn/_mitlearn__sources.yml- Source definitionssrc/ol_dbt/models/staging/mitlearn/_stg_mitlearn__models.yml- Consolidated model definitionssrc/ol_dbt/models/staging/mitlearn/stg_mitlearn__raw__mitlearn__app__postgres__users_user.sql- Individual SQL files- Additional SQL files for other discovered user-related tables
python bin/dbt-create-staging-models.py generate-sources \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__auth \
--target productionThis merges auth-related tables into the existing _mitlearn__sources.yml file.
- The script follows existing dbt project conventions and naming patterns
- Source files use the standard
ol_warehouse_raw_datasource with dynamic schema configuration - Generated staging models reference the correct source and include all discovered columns
- The script handles YAML merging to avoid duplicating source definitions
This repository includes a utility script for running uv commands across all code locations in the dg_deployment/code_locations directory. The script is located at bin/uv-operations.py.
The uv-operations.py script automatically discovers all directories containing a pyproject.toml file in the code locations directory and executes the specified uv command on each one. This is useful for operations like:
- Synchronizing dependencies across all code locations (
uv sync) - Upgrading lock files (
uv lock --upgrade) - Building packages (
uv build) - Listing installed packages (
uv pip list)
python bin/uv-operations.py <uv-command> [args...]Or run it directly as an executable:
./bin/uv-operations.py <uv-command> [args...]python bin/uv-operations.py syncpython bin/uv-operations.py lock --upgradepython bin/uv-operations.py pip listBy default, the script stops at the first failure. To continue processing all locations even if some fail:
python bin/uv-operations.py sync --continue-on-errorFor detailed output showing the exact commands being run:
python bin/uv-operations.py sync --verbose--code-locations-dir: Base directory containing code locations (default:dg_deployment/code_locations)--continue-on-error: Continue running even if some locations fail--verbose: Print verbose output including the full command being executed
The script provides:
- A list of discovered code locations
- Progress indicators for each location being processed
- Success (✓) or failure (✗) markers for each location
- A summary at the end showing successful and failed operations