Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
0d87b8f
chore: release version v0.9.2 (#228)
kohlisid May 24, 2025
3ed02e6
feat: add async source transformer (#230)
kohlisid Jun 24, 2025
42f9fbd
chore: prerelease for 0.10.0a0 (#231)
kohlisid Jun 25, 2025
9a390a5
chore: optimize example docker files using multi-stage builds (#232)
sapkota-aayush Jul 8, 2025
c96b488
3/5 accum test files pass
Jul 16, 2025
0647b3e
Fix accumulator async runtime issues
Jul 16, 2025
464411c
Fix accumulator tests and improve error handling
Jul 16, 2025
b49d53c
merge
Jul 16, 2025
19d61a3
Fix merge conflicts
Jul 16, 2025
10ebf46
fix: example
Jul 17, 2025
32fda32
fix: tests and add e2e test
Jul 20, 2025
b4d7ccc
fix: cleanup logs
Jul 20, 2025
6466f31
fix: resolve conflicts
Jul 20, 2025
35156ba
fix: use optimized Dockerfile
Jul 21, 2025
ea5576c
fix: lint
Jul 21, 2025
54c0224
fix: tests and lint
Jul 21, 2025
1d2863c
fix: update example
Jul 21, 2025
82f83d5
fix: tests
Jul 21, 2025
ee3acf9
fix: lint
Jul 21, 2025
6b76b9d
fix: update proto
Jul 21, 2025
5774e80
fix: update docstring
Jul 21, 2025
7cb73d7
fix: update docker image name
Jul 21, 2025
d6968b6
fix: update close task
Jul 22, 2025
7faf39b
chore: fix broken make proto (#235)
tmenjo Jul 22, 2025
8aebf04
fix: tests
Jul 25, 2025
2070401
fix: add comprehensive accumulator window operation tests
Jul 25, 2025
0156e24
fix: conflicts
Jul 25, 2025
6d8399a
fix: comments and tests
Jul 27, 2025
53ec8da
fix: remove extra STREAM_EOF
Jul 27, 2025
e1a7c9a
fix: lint
Jul 27, 2025
42c0264
fix: remove tests
Jul 30, 2025
be9c47f
fix: lint
Jul 30, 2025
597638f
fix: add accumulator for e2e tests
Jul 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-push.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
"examples/reducestream/counter", "examples/reducestream/sum", "examples/sideinput/simple_sideinput",
"examples/sideinput/simple_sideinput/udf", "examples/sink/async_log", "examples/sink/log",
"examples/source/simple_source", "examples/sourcetransform/event_time_filter",
"examples/batchmap/flatmap"
"examples/batchmap/flatmap", "examples/accumulator/streamsorter"
]

steps:
Expand Down
17 changes: 9 additions & 8 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,14 @@ setup:
poetry install --with dev --no-root

proto:
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sinker -I=pynumaflow/proto/sinker --python_out=pynumaflow/proto/sinker --grpc_python_out=pynumaflow/proto/sinker pynumaflow/proto/sinker/*.proto
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/mapper -I=pynumaflow/proto/mapper --python_out=pynumaflow/proto/mapper --grpc_python_out=pynumaflow/proto/mapper pynumaflow/proto/mapper/*.proto
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/reducer -I=pynumaflow/proto/reducer --python_out=pynumaflow/proto/reducer --grpc_python_out=pynumaflow/proto/reducer pynumaflow/proto/reducer/*.proto
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sourcetransformer -I=pynumaflow/proto/sourcetransformer --python_out=pynumaflow/proto/sourcetransformer --grpc_python_out=pynumaflow/proto/sourcetransformer pynumaflow/proto/sourcetransformer/*.proto
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sideinput -I=pynumaflow/proto/sideinput --python_out=pynumaflow/proto/sideinput --grpc_python_out=pynumaflow/proto/sideinput pynumaflow/proto/sideinput/*.proto
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sourcer -I=pynumaflow/proto/sourcer --python_out=pynumaflow/proto/sourcer --grpc_python_out=pynumaflow/proto/sourcer pynumaflow/proto/sourcer/*.proto
python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/accumulator -I=pynumaflow/proto/accumulator --python_out=pynumaflow/proto/accumulator --grpc_python_out=pynumaflow/proto/accumulator pynumaflow/proto/accumulator/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sinker -I=pynumaflow/proto/sinker --python_out=pynumaflow/proto/sinker --grpc_python_out=pynumaflow/proto/sinker pynumaflow/proto/sinker/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/mapper -I=pynumaflow/proto/mapper --python_out=pynumaflow/proto/mapper --grpc_python_out=pynumaflow/proto/mapper pynumaflow/proto/mapper/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/reducer -I=pynumaflow/proto/reducer --python_out=pynumaflow/proto/reducer --grpc_python_out=pynumaflow/proto/reducer pynumaflow/proto/reducer/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sourcetransformer -I=pynumaflow/proto/sourcetransformer --python_out=pynumaflow/proto/sourcetransformer --grpc_python_out=pynumaflow/proto/sourcetransformer pynumaflow/proto/sourcetransformer/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sideinput -I=pynumaflow/proto/sideinput --python_out=pynumaflow/proto/sideinput --grpc_python_out=pynumaflow/proto/sideinput pynumaflow/proto/sideinput/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/sourcer -I=pynumaflow/proto/sourcer --python_out=pynumaflow/proto/sourcer --grpc_python_out=pynumaflow/proto/sourcer pynumaflow/proto/sourcer/*.proto
poetry run python3 -m grpc_tools.protoc --pyi_out=pynumaflow/proto/accumulator -I=pynumaflow/proto/accumulator --python_out=pynumaflow/proto/accumulator --grpc_python_out=pynumaflow/proto/accumulator pynumaflow/proto/accumulator/*.proto


sed -i '' 's/^\(import.*_pb2\)/from . \1/' pynumaflow/proto/*/*.py
sed -i.bak -e 's/^\(import.*_pb2\)/from . \1/' pynumaflow/proto/*/*.py
rm pynumaflow/proto/*/*.py.bak
229 changes: 229 additions & 0 deletions docs/DOCKER_OPTIMIZATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
# Docker Build Optimization for NumaFlow Python UDFs

## Overview

This document outlines the optimization strategies to reduce Docker build times for NumaFlow Python UDFs from 2+ minutes to under 30 seconds for subsequent builds.

## Current Issues

1. **Redundant dependency installation**: Each UDF rebuilds the entire pynumaflow package
2. **No layer caching**: Dependencies are reinstalled every time
3. **Copying entire project**: The `COPY ./ ./` copies everything, including unnecessary files
4. **No shared base layers**: Each UDF builds its own base environment

## Optimization Strategy: Three-Stage Approach

As suggested by @kohlisid, we implement a three-stage build approach:

### Stage 1: Base Layer
- Common Python environment and tools
- System dependencies (curl, wget, build-essential, git)
- Poetry installation
- dumb-init binary

### Stage 2: Environment Setup
- pynumaflow package installation
- Shared virtual environment creation
- This layer is cached unless `pyproject.toml` or `poetry.lock` changes

### Stage 3: Builder
- UDF-specific code and dependencies
- Reuses the pynumaflow installation from Stage 2
- Minimal additional dependencies

## Implementation Options

### Option 1: Optimized Multi-Stage Build (Recommended)

**File**: `examples/map/even_odd/Dockerfile.optimized`

**Benefits**:
- Better layer caching
- Reduced build time by ~60-70%
- No external dependencies

**Usage**:
```bash
cd examples/map/even_odd
make -f Makefile.optimized image
```

### Option 2: Shared Base Image (Fastest)

**Files**:
- `Dockerfile.base` (shared base image)
- `examples/map/even_odd/Dockerfile.shared-base` (UDF-specific)

**Benefits**:
- Maximum caching efficiency
- Build time reduced by ~80-90% for subsequent builds
- Perfect for CI/CD pipelines

**Usage**:
```bash
# Build base image once
docker build -f Dockerfile.base -t numaflow-python-base .

# Build UDF images (very fast)
cd examples/map/even_odd
make -f Makefile.optimized image-fast
```

## Performance Comparison

| Approach | First Build | Subsequent Builds | Cache Efficiency |
|----------|-------------|-------------------|------------------|
| Current | ~2-3 minutes | ~2-3 minutes | Poor |
| Optimized Multi-Stage | ~2-3 minutes | ~45-60 seconds | Good |
| Shared Base Image | ~2-3 minutes | ~15-30 seconds | Excellent |

## Implementation Steps

### 1. Build Shared Base Image (One-time setup)

```bash
# From project root
docker build -f Dockerfile.base -t numaflow-python-base .
```

### 2. Update UDF Dockerfiles

Replace the current Dockerfile with the optimized version:

```bash
# For each UDF directory
cp Dockerfile.optimized Dockerfile
# or
cp Dockerfile.shared-base Dockerfile
```

### 3. Update Makefiles

Use the optimized Makefile:

```bash
# For each UDF directory
cp Makefile.optimized Makefile
```

### 4. CI/CD Integration

For CI/CD pipelines, add the base image build step:

```yaml
# Example GitHub Actions step
- name: Build base image
run: docker build -f Dockerfile.base -t numaflow-python-base .

- name: Build UDF images
run: |
cd examples/map/even_odd
make image-fast
```

## Advanced Optimizations

### 1. Dependency Caching

The optimized Dockerfiles implement smart dependency caching:
- `pyproject.toml` and `poetry.lock` are copied first
- pynumaflow installation is cached separately
- UDF-specific dependencies are installed last

### 2. Layer Optimization

- Minimal system dependencies in runtime image
- Separate build and runtime stages
- Efficient file copying with specific paths

### 3. Build Context Optimization

- Copy only necessary files
- Use `.dockerignore` to exclude unnecessary files
- Minimize build context size

## Migration Guide

### For Existing UDFs

1. **Backup current Dockerfile**:
```bash
cp Dockerfile Dockerfile.backup
```

2. **Choose optimization approach**:
- For single UDF: Use `Dockerfile.optimized`
- For multiple UDFs: Use `Dockerfile.shared-base`

3. **Update Makefile**:
```bash
cp Makefile.optimized Makefile
```

4. **Test the build**:
```bash
make image
# or
make image-fast
```

### For New UDFs

1. **Use the optimized template**:
```bash
cp examples/map/even_odd/Dockerfile.optimized your-udf/Dockerfile
cp examples/map/even_odd/Makefile.optimized your-udf/Makefile
```

2. **Update paths in Dockerfile**:
- Change `EXAMPLE_PATH` to your UDF path
- Update `COPY` commands accordingly

## Troubleshooting

### Common Issues

1. **Base image not found**:
```bash
docker build -f Dockerfile.base -t numaflow-python-base .
```

2. **Permission issues**:
```bash
chmod +x entry.sh
```

3. **Poetry cache issues**:
```bash
poetry cache clear --all pypi
```

### Performance Monitoring

Monitor build times:
```bash
time make image
time make image-fast
```

## Future Enhancements

1. **Registry-based base images**: Push base image to registry for team sharing
2. **BuildKit optimizations**: Enable BuildKit for parallel layer building
3. **Multi-platform builds**: Optimize for ARM64 and AMD64
4. **Dependency analysis**: Automate dependency optimization

## Contributing

When adding new UDFs or modifying existing ones:

1. Use the optimized Dockerfile templates
2. Follow the three-stage approach
3. Test build times before and after changes
4. Update this documentation if needed

## References

- [Docker Multi-Stage Builds](https://docs.docker.com/develop/dev-best-practices/multistage-build/)
- [Docker Layer Caching](https://docs.docker.com/develop/dev-best-practices/dockerfile_best-practices/#leverage-build-cache)
- [Poetry Docker Best Practices](https://python-poetry.org/docs/configuration/#virtualenvsin-project)
55 changes: 0 additions & 55 deletions examples/accumulator/counter/Dockerfile

This file was deleted.

46 changes: 0 additions & 46 deletions examples/accumulator/counter/example.py

This file was deleted.

Loading