Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
64ed543
docs: Fusion Snapshots incremental dumps and GCP support
fntlnz Nov 21, 2025
631232c
Apply changes from review
christopher-hakkaart Nov 27, 2025
5f922f5
Merge branch 'master' into lf/fusion-snapshots
christopher-hakkaart Nov 27, 2025
4c04118
Make changes to troubleshooting, and reorder sections
christopher-hakkaart Nov 28, 2025
23155a6
Move last sections over
christopher-hakkaart Nov 28, 2025
93a224d
Merge branch 'lf/fusion-snapshots' of github.com:seqeralabs/docs into…
christopher-hakkaart Nov 28, 2025
580567d
Revise headings
christopher-hakkaart Nov 28, 2025
9631d9a
Add connective links
christopher-hakkaart Nov 28, 2025
a55bcfb
Reframe contact section
christopher-hakkaart Nov 28, 2025
248faf0
Fix link
christopher-hakkaart Nov 28, 2025
bdb36df
Fix link
christopher-hakkaart Nov 28, 2025
dc89884
Add revisions
christopher-hakkaart Nov 30, 2025
552a6bb
Merge branch 'master' into lf/fusion-snapshots
christopher-hakkaart Nov 30, 2025
a3d6295
Code exit codes
christopher-hakkaart Nov 30, 2025
fecee51
Merge branch 'lf/fusion-snapshots' of github.com:seqeralabs/docs into…
christopher-hakkaart Dec 1, 2025
1233d2b
Add extra heading
christopher-hakkaart Dec 1, 2025
d807d97
Split admonition into two sections
christopher-hakkaart Dec 3, 2025
ff8bfd7
Add clean up section
christopher-hakkaart Dec 5, 2025
f5dbd0a
Update fusion_docs/guide/snapshots/gcp.md
MichaelTansiniSeqera Dec 8, 2025
05b214a
Update fusion_docs/guide/snapshots/index.md
MichaelTansiniSeqera Dec 8, 2025
b70f909
Update fusion_docs/guide/snapshots/gcp.md
MichaelTansiniSeqera Dec 8, 2025
7995499
Update fusion_docs/guide/snapshots/gcp.md
christopher-hakkaart Dec 8, 2025
2c47ce5
Merge branch 'master' into lf/fusion-snapshots
christopher-hakkaart Dec 11, 2025
b77bc0f
Update fusion_docs/troubleshooting.md
christopher-hakkaart Dec 17, 2025
11ac471
Update fusion_docs/troubleshooting.md
justinegeffen Dec 19, 2025
86d4d2f
Merge branch 'master' into lf/fusion-snapshots
justinegeffen Dec 19, 2025
2a767d3
Update aws.md
justinegeffen Dec 19, 2025
35c302a
Update configuration.md
justinegeffen Dec 19, 2025
1f0e367
Update gcp.md
justinegeffen Dec 19, 2025
3cfc2b0
Revise Fusion Snapshots documentation
justinegeffen Dec 19, 2025
0dc30d7
Update troubleshooting.md
justinegeffen Dec 19, 2025
2adde32
Update fusion_docs/guide/snapshots/gcp.md
justinegeffen Dec 19, 2025
ae59639
Update troubleshooting.md
justinegeffen Dec 19, 2025
912aad7
Apply suggestion from @justinegeffen
justinegeffen Dec 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 0 additions & 88 deletions fusion_docs/guide/snapshots.md

This file was deleted.

100 changes: 100 additions & 0 deletions fusion_docs/guide/snapshots/aws.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: AWS Batch
description: "Fusion Snapshots configuration and best practices for AWS Batch"
date created: "2024-11-21"
last updated: "2025-12-19"
tags: [fusion, fusion-snapshots, storage, compute, snapshot, aws, batch]
---

Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a guaranteed 120-second warning window to checkpoint and save the task state before the instance terminates.

## Seqera Platform compute environment requirements

Fusion Snapshots require the following Seqera Platform compute environment configuration:

- **Provider:** AWS Batch
- **Work directory:** S3 bucket in the same region as compute resources
- **Fusion Snapshots (beta):** Enabled
- **Config mode:** Batch Forge
- **Provisioning model:** Spot
- **AMI:** See [Selecting an AMI](#selecting-an-ami) for details
- **Instance type:** See [Selecting an EC2 instance](#selecting-an-ec2-instance) for details

:::tip
Fusion Snapshots work with sensible defaults (e.g., 5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md).
:::

### Selecting an AMI

Fusion Snapshots require instances running Amazon Linux 2023 (which ships with Linux Kernel 6.1) and an ECS container-optimized AMI for optimal performance.

#### Seqera Cloud

Seqera Cloud AWS Batch compute environments use an ECS container-optimized AMI by default. No additional AMI configuration is required.

#### Seqera Enterprise

Specify an Amazon Linux 2023 ECS-optimized AMI for your region when creating your compute environment.

To find the recommended AMI:

1. Retrieve the application configuration:

```bash
export REGION=<AWS_REGION>
aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
```

Replace `<AWS_REGION>` with your AWS region (for example, `eu-central-1`).

The output for the `eu-central-1` region is similar to the following:

```json
{
"Parameter": {
"Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
"Type": "String",
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
"Version": 61,
"LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
"ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
"DataType": "text"
}
```

1. Identify the `image_id` in your output (e.g, `ami-0281c9a5cd9de63bd` in the above example) and set in the **Advanced options > AMI ID** field when you create your Seqera compute environment.

## Selecting an EC2 instance

AWS provides a guaranteed 120-second reclamation window. Select instance types that can transfer checkpoint data within this timeframe. Checkpoint time is primarily determined by memory usage. Other factors like the number of open file descriptors also affect performance.

When you select an EC2 instance:

- Select instances with guaranteed network bandwidth, not "up to" values.
- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
- Prefer NVMe storage instances (those with a `d` suffix: `c6id`, `r6id`, `m6id`).
- Use `x86_64` instances for [incremental snapshots](./index.md#incremental-snapshots).

For example, a `c6id.8xlarge` instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds. Instances with memory:bandwidth ratios over 5:1 may not complete transfers before termination and risk task failures.

| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Estimated snapshot time |
|----------------|-------|--------------|--------------------------|------------------------|-------------------------|
| `c6id.4xlarge` | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds |
| `c6id.8xlarge` | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds |
| `r6id.2xlarge` | 8 | 16 | 12.5 | 1.28:1 | ~20 seconds |
| `m6id.4xlarge` | 16 | 64 | 12.5 | 5.12:1 | ~70 seconds |
| `c6id.12xlarge`| 48 | 96 | 18.75 | 5.12:1 | ~70 seconds |
| `r6id.4xlarge` | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds |
| `m6id.8xlarge` | 32 | 128 | 25 | 5.12:1 | ~70 seconds |

:::info
[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on `x86_64` instances.
:::

## Resource limits

A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information.

## Manual cleanup

The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.
136 changes: 136 additions & 0 deletions fusion_docs/guide/snapshots/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: Advanced configuration
description: "Advanced configuration options for Fusion Snapshots"
date created: "2024-11-29"
last updated: "2025-12-19"
tags: [fusion, fusion-snapshots, snapshot, configuration, nextflow]
---

Fusion Snapshots work optimally with default configuration for most workloads. You typically do not need to modify these settings unless you have specific organizational policies, experience issues with default behavior, or have edge case requirements.

:::tip
For troubleshooting, focus on task memory usage and instance selection before adjusting these advanced configuration options. See [Troubleshooting](../../troubleshooting.md) for more information.
:::

## Retry handling

When Spot instances are reclaimed, you can configure how Nextflow retries the tasks. There are two approaches:

- [Automatic retries with `maxSpotAttempts`](#automatic-retries-with-maxspotattempts)
- [Fine-grained retries with `errorStrategy`](#fine-grained-retries-with-errorstrategy)

### Automatic retries with `maxSpotAttempts`

The simplest approach uses `maxSpotAttempts` to automatically retry any task that fails due to spot reclamation, regardless of the specific failure reason. When you enable Fusion Snapshots, Nextflow automatically sets `maxSpotAttempts = 5`. This allows the checkpoint to be restored on a new instance after reclamation up to 5 times.

**Increase retries**

If you experience frequent Spot reclamations, increase `maxSpotAttempts` above `5`:

- AWS Batch:

```groovy
aws.batch.maxSpotAttempts = 10
```

- Google Cloud Batch:

```groovy
google.batch.maxSpotAttempts = 10
```

**Disable retries**

To disable automatic retries, set `maxSpotAttempts = 0`:

- AWS Batch:

```groovy
aws.batch.maxSpotAttempts = 0
```

- Google Cloud Batch:

```groovy
google.batch.maxSpotAttempts = 0
```

### Fine-grained retries with `errorStrategy`

For fine-grained control of retries, configure your Nextflow [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to implement retry logic based on specific checkpoint failure types. This allows you to handle different failure scenarios (e.g., checkpoint dump failures differently from restore failures) differently.

To configure, set to `maxSpotAttempts = 0` and add an [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to your process configuration. For example:

```groovy
process {
maxRetries = 2
errorStrategy = {
if (task.exitStatus == 175) {
return 'retry' // Retry checkpoint dump failures
} else {
return 'terminate' // Don't retry other failures
}
}
}
```

**Exit codes**:

- `175`: Checkpoint dump failed — The snapshot could not be saved (e.g., insufficient memory, I/O errors).
- `176`: Checkpoint restore failed — The snapshot could not be restored on the new instance.

**Configuration options**:

See [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) for more configuration options.

## TCP connection handling

By default, Fusion Snapshots use `established` mode to preserve TCP connections during checkpoint operations. This works well for plain TCP connections. If your application uses SSL/TLS connections (HTTPS, SSH, etc.), you need to configure TCP close mode because CRIU cannot preserve encrypted connections.

To close all TCP connections during checkpoint operations, set:

```groovy
process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close'
```

**Options:**

- `established`: Preserve TCP connections (default).
- `close`: Close all TCP connections during checkpoint.

## Debug logging

By default, Fusion Snapshots use `WARN` level logging (warnings and errors only). If you are troubleshooting checkpoint issues, you can enable more detailed logging to help diagnose problems.

To enable debug logging, set:

```groovy
process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
```

**Log levels**:

- `ERROR`: Only critical errors
- `WARN`: Warnings and errors (default)
- `INFO`: General informational messages
- `DEBUG`: Detailed debug information

:::warning
Use `debug` logging only when troubleshooting. It is verbose and may impact performance.
:::

## Resource limits

By default, tasks can request any amount of resources. If a task requests more resources than are available on a single instance, the job waits indefinitely and never runs. Use the `process.resourceLimits` directive to set maximum requested resources below the capacity of a single instance.

Setting resource limits ensures tasks can checkpoint successfully and prevents jobs from becoming unschedulable. For example:

```groovy
// AWS Batch example (120-second reclamation window)
process.resourceLimits = [cpus: 32, memory: '60.GB']

// Google Cloud Batch example (Up to 30-second reclamation window - more conservative)
process.resourceLimits = [cpus: 16, memory: '20.GB']
```

See [AWS Batch](./aws.md) or [Google Cloud Batch](./gcp.md) for more information about reclamation windows. See [`resourceLimits`](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits) for more configuration options.
44 changes: 44 additions & 0 deletions fusion_docs/guide/snapshots/gcp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: Google Cloud Batch
description: "Fusion Snapshots configuration and best practices for Google Cloud Batch"
date created: "2024-11-29"
last updated: "2025-12-19"
tags: [fusion, fusion-snapshots, storage, compute, snapshot, gcp, google, batch]
---

Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Cloud Batch preemptible instances. When a preemption occurs, Google Batch provides up to 30 seconds before instance termination.

:::note
When using Google Cloud Batch, Fusion Snapshots is currently only available for Seqera Cloud.
:::

:::warning
Google Cloud [guarantees only up to 30 seconds](https://cloud.google.com/compute/docs/instances/spot) before instance termination. Careful instance selection and conservative memory planning are critical for successful checkpoints.
:::

## Seqera Platform compute environment requirements

Fusion Snapshots require the following Seqera Platform compute environment configuration:

- **Provider**: Google Batch
- **Work directory**: GCS bucket in the same region as compute resources
- **Fusion**: Enabled
- **Wave**: Enabled
- **Fusion Snapshots (beta)**: Enabled
- **Provisioning model**: Spot

:::tip Configuration
Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md).
:::

## Incremental snapshots

[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on x86_64 instances and capture only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window. Use x86_64 instances to enable incremental snapshots.

## Resource limits

A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information.

## Manual cleanup

The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.
Loading