Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,3 +141,4 @@ runpod/_version.py
.runpod_jobs.pkl

*.lock
benchmark_results/
285 changes: 285 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
# Cold Start Benchmarking

Performance benchmarking tools for measuring and comparing cold start times across different code changes.

## Quick Start

```bash
# Run benchmark on current branch
uv run pytest tests/test_performance/test_cold_start.py

# Compare two branches
./scripts/benchmark_cold_start.sh main my-feature-branch

# Compare two existing result files
uv run python scripts/compare_benchmarks.py benchmark_results/cold_start_baseline.json benchmark_results/cold_start_latest.json
```

## What Gets Measured

- **Import times**: `import runpod`, `import runpod.serverless`, `import runpod.endpoint`
- **Module counts**: Total modules loaded and runpod-specific modules
- **Lazy loading status**: Whether paramiko and SSH CLI are eagerly or lazy-loaded
- **Statistics**: Min, max, mean, median across 10 iterations per measurement

## Tools

### 1. test_cold_start.py

Core benchmark test that measures import performance in fresh Python subprocesses.

```bash
# Run as pytest test
uv run pytest tests/test_performance/test_cold_start.py -v

# Run as standalone script
uv run python tests/test_performance/test_cold_start.py

# Results saved to:
# - benchmark_results/cold_start_<timestamp>.json
# - benchmark_results/cold_start_latest.json (always latest)
```

**Output Example:**
```
Running cold start benchmarks...
------------------------------------------------------------
Measuring 'import runpod'...
Mean: 273.29ms
Measuring 'import runpod.serverless'...
Mean: 332.18ms
Counting loaded modules...
Total modules: 582
Runpod modules: 46
Checking if paramiko is eagerly loaded...
Paramiko loaded: False
```

### 2. benchmark_cold_start.sh

Automated benchmark runner that handles git branch switching, dependency installation, and result collection.

```bash
# Run on current branch (no git operations)
./scripts/benchmark_cold_start.sh

# Run on specific branch
./scripts/benchmark_cold_start.sh main

# Compare two branches (runs both, then compares)
./scripts/benchmark_cold_start.sh main feature/lazy-loading
```

**Features:**
- Automatic stash/unstash of uncommitted changes
- Dependency installation per branch
- Safe branch switching with restoration
- Timestamped result files
- Automatic comparison when comparing branches

**Safety:**
- Stashes uncommitted changes before switching branches
- Restores original branch after completion
- Handles errors gracefully

### 3. compare_benchmarks.py

Analyzes and visualizes differences between two benchmark runs with colored terminal output.

```bash
uv run python scripts/compare_benchmarks.py <baseline.json> <optimized.json>
```

**Output Example:**
```
======================================================================
COLD START BENCHMARK COMPARISON
======================================================================

IMPORT TIME COMPARISON
----------------------------------------------------------------------
Metric Baseline Optimized Δ ms Δ %
----------------------------------------------------------------------
runpod_total 285.64ms 273.29ms ↓ 12.35ms 4.32%
runpod_serverless 376.33ms 395.14ms ↑ -18.81ms -5.00%
runpod_endpoint 378.61ms 399.36ms ↑ -20.75ms -5.48%

MODULE LOAD COMPARISON
----------------------------------------------------------------------
Total modules loaded:
Baseline: 698 Optimized: 582 Δ: 116
Runpod modules loaded:
Baseline: 48 Optimized: 46 Δ: 2

LAZY LOADING STATUS
----------------------------------------------------------------------
Paramiko Baseline: LOADED Optimized: NOT LOADED ✓ NOW LAZY
SSH CLI Baseline: LOADED Optimized: NOT LOADED ✓ NOW LAZY

======================================================================
SUMMARY
======================================================================
✓ Cold start improved by 12.35ms
✓ That's a 4.3% improvement over baseline
✓ Baseline: 285.64ms → Optimized: 273.29ms
======================================================================
```

**Color coding:**
- Green: Improvements (faster times, lazy loading achieved)
- Red: Regressions (slower times, eager loading introduced)
- Yellow: No change

## Result Files

All benchmark results are saved to `benchmark_results/` (gitignored).

**File naming:**
- `cold_start_<timestamp>.json` - Timestamped result
- `cold_start_latest.json` - Always contains most recent result
- `cold_start_baseline.json` - Manually saved baseline for comparison

**JSON structure:**
```json
{
"timestamp": 1763179522.0437188,
"python_version": "3.8.20 (default, Oct 2 2024, 16:12:59) [Clang 18.1.8 ]",
"measurements": {
"runpod_total": {
"min": 375.97,
"max": 527.9,
"mean": 393.91,
"median": 380.4,
"iterations": 10
}
},
"module_counts": {
"total": 698,
"filtered": 48
},
"paramiko_eagerly_loaded": true,
"ssh_cli_loaded": true
}
```

## Common Workflows

### Testing a Performance Optimization

```bash
# 1. Save baseline on main branch
git checkout main
./scripts/benchmark_cold_start.sh
cp benchmark_results/cold_start_latest.json benchmark_results/cold_start_baseline.json

# 2. Switch to feature branch
git checkout feature/my-optimization

# 3. Run benchmark and compare
./scripts/benchmark_cold_start.sh
uv run python scripts/compare_benchmarks.py \
benchmark_results/cold_start_baseline.json \
benchmark_results/cold_start_latest.json
```

### Comparing Multiple Approaches

```bash
# Compare three different optimization branches
./scripts/benchmark_cold_start.sh main > results_main.txt
./scripts/benchmark_cold_start.sh feature/approach-1 > results_1.txt
./scripts/benchmark_cold_start.sh feature/approach-2 > results_2.txt

# Then compare each against baseline
uv run python scripts/compare_benchmarks.py \
benchmark_results/cold_start_main_*.json \
benchmark_results/cold_start_approach-1_*.json
```

### CI/CD Integration

Add to your GitHub Actions workflow:

```yaml
- name: Run cold start benchmark
run: |
uv run pytest tests/test_performance/test_cold_start.py --timeout=120

- name: Upload benchmark results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: benchmark_results/cold_start_latest.json
```

## Performance Targets

Based on testing with Python 3.8:

- **Cold start (import runpod)**: < 300ms (mean)
- **Serverless import**: < 400ms (mean)
- **Module count**: < 600 total modules
- **Test assertion**: Fails if import > 1000ms

## Interpreting Results

### Import Time Variance

Subprocess-based measurements have inherent variance:
- First run in sequence: Often 20-50ms slower (Python startup overhead)
- Subsequent runs: More stable
- **Use median or mean** for comparison, not single runs

### Module Count

- **Fewer modules = faster cold start**: Each module has import overhead
- **Runpod-specific modules**: Should be minimal (40-50)
- **Total modules**: Includes stdlib and dependencies
- **Target reduction**: Removing 100+ modules typically saves 10-30ms

### Lazy Loading Validation

- `paramiko_eagerly_loaded: false` - Good for serverless workers
- `ssh_cli_loaded: false` - Good for SDK users
- These should only be `true` when CLI commands are invoked

## Troubleshooting

### High Variance in Results

If you see >100ms variance between runs:
- System is under load
- Disk I/O contention
- Python bytecode cache issues

**Solution:** Run multiple times and use median values.

### benchmark_cold_start.sh Fails

```bash
# Check git status
git status

# Manually restore if script failed mid-execution
git checkout <original-branch>
git stash pop
```

### Import Errors During Benchmark

Ensure dependencies are installed:
```bash
uv sync --group test
```

## Benchmark Accuracy

- **Iterations**: 10 per measurement (configurable in test)
- **Process isolation**: Each measurement uses fresh subprocess
- **Python cache**: Cleared by subprocess creation
- **System state**: Cannot control OS-level caching

For production performance testing, consider:
- Running on CI with consistent environment
- Multiple runs at different times
- Comparing trends over multiple commits
Loading
Loading