Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Jul 2, 2025

This PR implements a comprehensive GitHub Actions workflow that automatically runs all data cleaning scripts on every pull request to the main branch.

What this adds

  • CI Workflow: .github/workflows/data-cleaning.yml that triggers on pull requests to main
  • Conda Environment Setup: Automatically installs Miniconda and creates required environments (nf-core and env)
  • Automated Script Discovery: Dynamically finds and executes all 00_run_clean_raw_data.sh scripts across the repository

How it works

The workflow discovers and runs 6 data cleaning scripts with their corresponding Nextflow workflows:

scripts/bp3c50id/00_run_clean_raw_data.sh → workflows/00_clean_raw_data.bp3c50id.nf
scripts/hv_class/00_run_clean_raw_data.sh → workflows/00_clean_raw_data.hv_class.nf
scripts/hv_seg/00_run_clean_raw_data.sh → workflows/00_clean_raw_data.hv_seg.nf
scripts/iedb_bp3/00_run_clean_raw_data.sh → workflows/00_clean_raw_data.iedb_bp3.nf
scripts/in_class/00_run_clean_raw_data.sh → workflows/00_clean_raw_data.in_class.nf
scripts/in_seg/00_run_clean_raw_data.sh → workflows/00_clean_raw_data.in_seg.nf

For each script, the workflow:

  1. Creates necessary temporary directories (tmp/nextflow/{dataset}/clean_raw_data/)
  2. Sets up environment variables (NXF_LOG_FILE, NXF_CACHE_DIR)
  3. Executes the corresponding Nextflow workflow using conda run -n nf-core
  4. Handles expected failures gracefully (e.g., missing data in CI environment)

Key features

  • Environment Isolation: Uses conda environments exactly as defined in the original scripts
  • Robust Error Handling: Continues processing other scripts even if one fails
  • Scalable: Automatically discovers new data cleaning scripts without workflow updates
  • SLURM-Compatible: Adapts SLURM-based scripts for GitHub Actions while preserving execution logic

The workflow ensures data cleaning scripts remain functional and can catch breaking changes early in the development process.

Fixes #23.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Write a github action that runs all data cleaning scripts Add GitHub Actions workflow for automated data cleaning CI Jul 2, 2025
@Copilot Copilot AI requested a review from ljwoods2 July 2, 2025 03:22
Copilot finished work on behalf of ljwoods2 July 2, 2025 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write a github action that runs all data cleaning scripts
2 participants