seqeralabs · justinegeffen · Dec 19, 2025 · Nov 21, 2025 · Nov 27, 2025 · Nov 27, 2025
diff --git a/fusion_docs/guide/snapshots.md b/fusion_docs/guide/snapshots.md
diff --git a/fusion_docs/guide/snapshots/aws.md b/fusion_docs/guide/snapshots/aws.md
@@ -0,0 +1,100 @@
+---
+title: AWS Batch
+description: "Fusion Snapshots configuration and best practices for AWS Batch"
+date created: "2024-11-21"
+last updated: "2025-12-19"
+tags: [fusion, fusion-snapshots, storage, compute, snapshot, aws, batch]
+---
+
+Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a guaranteed 120-second warning window to checkpoint and save the task state before the instance terminates.
+
+## Seqera Platform compute environment requirements
+
+Fusion Snapshots require the following Seqera Platform compute environment configuration:
+
+- **Provider:** AWS Batch
+- **Work directory:** S3 bucket in the same region as compute resources
+- **Fusion Snapshots (beta):** Enabled
+- **Config mode:** Batch Forge
+- **Provisioning model:** Spot
+- **AMI:** See [Selecting an AMI](#selecting-an-ami) for details
+- **Instance type:** See [Selecting an EC2 instance](#selecting-an-ec2-instance) for details
+
+:::tip
+Fusion Snapshots work with sensible defaults (e.g., 5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md).
+:::
+
+### Selecting an AMI
+
+Fusion Snapshots require instances running Amazon Linux 2023 (which ships with Linux Kernel 6.1) and an ECS container-optimized AMI for optimal performance.
+
+#### Seqera Cloud
+
+Seqera Cloud AWS Batch compute environments use an ECS container-optimized AMI by default. No additional AMI configuration is required.
+
+#### Seqera Enterprise
+
+Specify an Amazon Linux 2023 ECS-optimized AMI for your region when creating your compute environment.
+
+To find the recommended AMI:
+
+1. Retrieve the application configuration:
+
+    ```bash
+    export REGION=<AWS_REGION>
+    aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
+    ```
+
+    Replace `<AWS_REGION>` with your AWS region (for example, `eu-central-1`).
+
+    The output for the `eu-central-1` region is similar to the following:
+
+    ```json
+    {
+        "Parameter": {
+            "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
+            "Type": "String",
+            "Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
+            "Version": 61,
+            "LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
+            "ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
+            "DataType": "text"
+    }
+    ```
+
+1. Identify the `image_id` in your output (e.g, `ami-0281c9a5cd9de63bd` in the above example) and set in the **Advanced options > AMI ID** field when you create your Seqera compute environment.
+
+## Selecting an EC2 instance
+
+AWS provides a guaranteed 120-second reclamation window. Select instance types that can transfer checkpoint data within this timeframe. Checkpoint time is primarily determined by memory usage. Other factors like the number of open file descriptors also affect performance.
+
+When you select an EC2 instance:
+
+- Select instances with guaranteed network bandwidth, not "up to" values.
+- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
+- Prefer NVMe storage instances (those with a `d` suffix: `c6id`, `r6id`, `m6id`).
+- Use `x86_64` instances for [incremental snapshots](./index.md#incremental-snapshots).
+
+For example, a `c6id.8xlarge` instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds. Instances with memory:bandwidth ratios over 5:1 may not complete transfers before termination and risk task failures.
+
+| Instance type  | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Estimated snapshot time |
+|----------------|-------|--------------|--------------------------|------------------------|-------------------------|
+| `c6id.4xlarge` | 16    | 32           | 12.5                     | 2.56:1                 | ~45 seconds             |
+| `c6id.8xlarge` | 32    | 64           | 12.5                     | 5.12:1                 | ~70 seconds             |
+| `r6id.2xlarge` | 8     | 16           | 12.5                     | 1.28:1                 | ~20 seconds             |
+| `m6id.4xlarge` | 16    | 64           | 12.5                     | 5.12:1                 | ~70 seconds             |
+| `c6id.12xlarge`| 48    | 96           | 18.75                    | 5.12:1                 | ~70 seconds             |
+| `r6id.4xlarge` | 16    | 128          | 12.5                     | 10.24:1                | ~105 seconds            |
+| `m6id.8xlarge` | 32    | 128          | 25                       | 5.12:1                 | ~70 seconds             |
+
+:::info
+[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on `x86_64` instances.
+:::
+
+## Resource limits
+
+A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information.
+
+## Manual cleanup
+
+The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.
diff --git a/fusion_docs/guide/snapshots/configuration.md b/fusion_docs/guide/snapshots/configuration.md
@@ -0,0 +1,136 @@
+---
+title: Advanced configuration
+description: "Advanced configuration options for Fusion Snapshots"
+date created: "2024-11-29"
+last updated: "2025-12-19"
+tags: [fusion, fusion-snapshots, snapshot, configuration, nextflow]
+---
+
+Fusion Snapshots work optimally with default configuration for most workloads. You typically do not need to modify these settings unless you have specific organizational policies, experience issues with default behavior, or have edge case requirements.
+
+:::tip
+For troubleshooting, focus on task memory usage and instance selection before adjusting these advanced configuration options. See [Troubleshooting](../../troubleshooting.md) for more information.
+:::
+
+## Retry handling
+
+When Spot instances are reclaimed, you can configure how Nextflow retries the tasks. There are two approaches:
+
+- [Automatic retries with `maxSpotAttempts`](#automatic-retries-with-maxspotattempts)
+- [Fine-grained retries with `errorStrategy`](#fine-grained-retries-with-errorstrategy)
+
+### Automatic retries with `maxSpotAttempts`
+
+The simplest approach uses `maxSpotAttempts` to automatically retry any task that fails due to spot reclamation, regardless of the specific failure reason. When you enable Fusion Snapshots, Nextflow automatically sets `maxSpotAttempts = 5`. This allows the checkpoint to be restored on a new instance after reclamation up to 5 times.
+
+**Increase retries**
+
+If you experience frequent Spot reclamations, increase `maxSpotAttempts` above `5`:
+
+- AWS Batch:
+
+    ```groovy
+    aws.batch.maxSpotAttempts = 10
+    ```
+
+- Google Cloud Batch:
+
+    ```groovy
+    google.batch.maxSpotAttempts = 10
+    ```
+
+**Disable retries**
+
+To disable automatic retries, set `maxSpotAttempts = 0`:
+
+- AWS Batch:
+
+    ```groovy
+    aws.batch.maxSpotAttempts = 0
+    ```
+
+- Google Cloud Batch:
+
+    ```groovy
+    google.batch.maxSpotAttempts = 0
+    ```
+
+### Fine-grained retries with `errorStrategy`
+
+For fine-grained control of retries, configure your Nextflow [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to implement retry logic based on specific checkpoint failure types. This allows you to handle different failure scenarios (e.g., checkpoint dump failures differently from restore failures) differently.
+
+To configure, set to `maxSpotAttempts = 0` and add an [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to your process configuration. For example:
+
+```groovy
+process {
+    maxRetries = 2
+    errorStrategy = {
+        if (task.exitStatus == 175) {
+            return 'retry'  // Retry checkpoint dump failures
+        } else {
+            return 'terminate'  // Don't retry other failures
+        }
+    }
+}
+```
+
+**Exit codes**:
+
+- `175`: Checkpoint dump failed — The snapshot could not be saved (e.g., insufficient memory, I/O errors).
+- `176`: Checkpoint restore failed — The snapshot could not be restored on the new instance.
+
+**Configuration options**:
+
+See [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) for more configuration options.
+
+## TCP connection handling
+
+By default, Fusion Snapshots use `established` mode to preserve TCP connections during checkpoint operations. This works well for plain TCP connections. If your application uses SSL/TLS connections (HTTPS, SSH, etc.), you need to configure TCP close mode because CRIU cannot preserve encrypted connections.
+
+To close all TCP connections during checkpoint operations, set:
+
+```groovy
+process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close'
+```
+
+**Options:**
+
+- `established`: Preserve TCP connections (default).
+- `close`: Close all TCP connections during checkpoint.
+
+## Debug logging
+
+By default, Fusion Snapshots use `WARN` level logging (warnings and errors only). If you are troubleshooting checkpoint issues, you can enable more detailed logging to help diagnose problems.
+
+To enable debug logging, set:
+
+```groovy
+process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
+```
+
+**Log levels**:
+
+- `ERROR`: Only critical errors
+- `WARN`: Warnings and errors (default)
+- `INFO`: General informational messages
+- `DEBUG`: Detailed debug information
+
+:::warning
+Use `debug` logging only when troubleshooting. It is verbose and may impact performance.
+:::
+
+## Resource limits
+
+By default, tasks can request any amount of resources. If a task requests more resources than are available on a single instance, the job waits indefinitely and never runs. Use the `process.resourceLimits` directive to set maximum requested resources below the capacity of a single instance.
+
+Setting resource limits ensures tasks can checkpoint successfully and prevents jobs from becoming unschedulable. For example:
+
+```groovy
+// AWS Batch example (120-second reclamation window)
+process.resourceLimits = [cpus: 32, memory: '60.GB']
+
+// Google Cloud Batch example (Up to 30-second reclamation window - more conservative)
+process.resourceLimits = [cpus: 16, memory: '20.GB']
+```
+
+See [AWS Batch](./aws.md) or [Google Cloud Batch](./gcp.md) for more information about reclamation windows. See [`resourceLimits`](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits) for more configuration options.
diff --git a/fusion_docs/guide/snapshots/gcp.md b/fusion_docs/guide/snapshots/gcp.md
@@ -0,0 +1,44 @@
+---
+title: Google Cloud Batch
+description: "Fusion Snapshots configuration and best practices for Google Cloud Batch"
+date created: "2024-11-29"
+last updated: "2025-12-19"
+tags: [fusion, fusion-snapshots, storage, compute, snapshot, gcp, google, batch]
+---
+
+Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Cloud Batch preemptible instances. When a preemption occurs, Google Batch provides up to 30 seconds before instance termination.
+
+:::note
+When using Google Cloud Batch, Fusion Snapshots is currently only available for Seqera Cloud.
+:::
+
+:::warning
+Google Cloud [guarantees only up to 30 seconds](https://cloud.google.com/compute/docs/instances/spot) before instance termination. Careful instance selection and conservative memory planning are critical for successful checkpoints.
+:::
+
+## Seqera Platform compute environment requirements
+
+Fusion Snapshots require the following Seqera Platform compute environment configuration:
+
+- **Provider**: Google Batch
+- **Work directory**: GCS bucket in the same region as compute resources
+- **Fusion**: Enabled
+- **Wave**: Enabled
+- **Fusion Snapshots (beta)**: Enabled
+- **Provisioning model**: Spot
+
+:::tip Configuration
+Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md).
+:::
+
+## Incremental snapshots
+
+[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on x86_64 instances and capture only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window. Use x86_64 instances to enable incremental snapshots.
+
+## Resource limits
+
+A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information.
+
+## Manual cleanup
+
+The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.