Skip to content

Commit 07085dc

Browse files
fntlnzchristopher-hakkaartMichaelTansiniSeqerajustinegeffen
authored
docs: Fusion Snapshots incremental dumps and GCP support (#929)
* docs: Fusion Snapshots incremental dumps and GCP support Some restructuring on the Snapshots documentation as well to accomodate the fact that we went multi cloud now. * Apply changes from review * Make changes to troubleshooting, and reorder sections * Move last sections over * Revise headings * Add connective links * Reframe contact section * Fix link * Fix link * Add revisions * Code exit codes * Add extra heading * Split admonition into two sections * Add clean up section * Update fusion_docs/guide/snapshots/gcp.md Co-authored-by: Chris Hakkaart <[email protected]> Signed-off-by: MichaelTansiniSeqera <[email protected]> * Update fusion_docs/guide/snapshots/index.md Co-authored-by: Chris Hakkaart <[email protected]> Signed-off-by: MichaelTansiniSeqera <[email protected]> * Update fusion_docs/guide/snapshots/gcp.md Co-authored-by: Chris Hakkaart <[email protected]> Signed-off-by: MichaelTansiniSeqera <[email protected]> * Update fusion_docs/guide/snapshots/gcp.md Signed-off-by: Chris Hakkaart <[email protected]> * Update fusion_docs/troubleshooting.md Signed-off-by: Chris Hakkaart <[email protected]> * Update fusion_docs/troubleshooting.md Co-authored-by: Chris Hakkaart <[email protected]> Signed-off-by: Justine Geffen <[email protected]> * Update aws.md Signed-off-by: Justine Geffen <[email protected]> * Update configuration.md Signed-off-by: Justine Geffen <[email protected]> * Update gcp.md Signed-off-by: Justine Geffen <[email protected]> * Revise Fusion Snapshots documentation Updated metadata and improved formatting for clarity and consistency. Signed-off-by: Justine Geffen <[email protected]> * Update troubleshooting.md Signed-off-by: Justine Geffen <[email protected]> * Update fusion_docs/guide/snapshots/gcp.md Co-authored-by: Chris Hakkaart <[email protected]> Signed-off-by: Justine Geffen <[email protected]> * Update troubleshooting.md Signed-off-by: Justine Geffen <[email protected]> * Apply suggestion from @justinegeffen Signed-off-by: Justine Geffen <[email protected]> --------- Signed-off-by: MichaelTansiniSeqera <[email protected]> Signed-off-by: Chris Hakkaart <[email protected]> Signed-off-by: Justine Geffen <[email protected]> Co-authored-by: Christopher Hakkaart <[email protected]> Co-authored-by: MichaelTansiniSeqera <[email protected]> Co-authored-by: Justine Geffen <[email protected]>
1 parent ee4f2d3 commit 07085dc

File tree

7 files changed

+662
-91
lines changed

7 files changed

+662
-91
lines changed

fusion_docs/guide/snapshots.md

Lines changed: 0 additions & 88 deletions
This file was deleted.

fusion_docs/guide/snapshots/aws.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: AWS Batch
3+
description: "Fusion Snapshots configuration and best practices for AWS Batch"
4+
date created: "2024-11-21"
5+
last updated: "2025-12-19"
6+
tags: [fusion, fusion-snapshots, storage, compute, snapshot, aws, batch]
7+
---
8+
9+
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a guaranteed 120-second warning window to checkpoint and save the task state before the instance terminates.
10+
11+
## Seqera Platform compute environment requirements
12+
13+
Fusion Snapshots require the following Seqera Platform compute environment configuration:
14+
15+
- **Provider:** AWS Batch
16+
- **Work directory:** S3 bucket in the same region as compute resources
17+
- **Fusion Snapshots (beta):** Enabled
18+
- **Config mode:** Batch Forge
19+
- **Provisioning model:** Spot
20+
- **AMI:** See [Selecting an AMI](#selecting-an-ami) for details
21+
- **Instance type:** See [Selecting an EC2 instance](#selecting-an-ec2-instance) for details
22+
23+
:::tip
24+
Fusion Snapshots work with sensible defaults (e.g., 5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md).
25+
:::
26+
27+
### Selecting an AMI
28+
29+
Fusion Snapshots require instances running Amazon Linux 2023 (which ships with Linux Kernel 6.1) and an ECS container-optimized AMI for optimal performance.
30+
31+
#### Seqera Cloud
32+
33+
Seqera Cloud AWS Batch compute environments use an ECS container-optimized AMI by default. No additional AMI configuration is required.
34+
35+
#### Seqera Enterprise
36+
37+
Specify an Amazon Linux 2023 ECS-optimized AMI for your region when creating your compute environment.
38+
39+
To find the recommended AMI:
40+
41+
1. Retrieve the application configuration:
42+
43+
```bash
44+
export REGION=<AWS_REGION>
45+
aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
46+
```
47+
48+
Replace `<AWS_REGION>` with your AWS region (for example, `eu-central-1`).
49+
50+
The output for the `eu-central-1` region is similar to the following:
51+
52+
```json
53+
{
54+
"Parameter": {
55+
"Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
56+
"Type": "String",
57+
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
58+
"Version": 61,
59+
"LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
60+
"ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
61+
"DataType": "text"
62+
}
63+
```
64+
65+
1. Identify the `image_id` in your output (e.g, `ami-0281c9a5cd9de63bd` in the above example) and set in the **Advanced options > AMI ID** field when you create your Seqera compute environment.
66+
67+
## Selecting an EC2 instance
68+
69+
AWS provides a guaranteed 120-second reclamation window. Select instance types that can transfer checkpoint data within this timeframe. Checkpoint time is primarily determined by memory usage. Other factors like the number of open file descriptors also affect performance.
70+
71+
When you select an EC2 instance:
72+
73+
- Select instances with guaranteed network bandwidth, not "up to" values.
74+
- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
75+
- Prefer NVMe storage instances (those with a `d` suffix: `c6id`, `r6id`, `m6id`).
76+
- Use `x86_64` instances for [incremental snapshots](./index.md#incremental-snapshots).
77+
78+
For example, a `c6id.8xlarge` instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds. Instances with memory:bandwidth ratios over 5:1 may not complete transfers before termination and risk task failures.
79+
80+
| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Estimated snapshot time |
81+
|----------------|-------|--------------|--------------------------|------------------------|-------------------------|
82+
| `c6id.4xlarge` | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds |
83+
| `c6id.8xlarge` | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds |
84+
| `r6id.2xlarge` | 8 | 16 | 12.5 | 1.28:1 | ~20 seconds |
85+
| `m6id.4xlarge` | 16 | 64 | 12.5 | 5.12:1 | ~70 seconds |
86+
| `c6id.12xlarge`| 48 | 96 | 18.75 | 5.12:1 | ~70 seconds |
87+
| `r6id.4xlarge` | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds |
88+
| `m6id.8xlarge` | 32 | 128 | 25 | 5.12:1 | ~70 seconds |
89+
90+
:::info
91+
[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on `x86_64` instances.
92+
:::
93+
94+
## Resource limits
95+
96+
A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information.
97+
98+
## Manual cleanup
99+
100+
The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
title: Advanced configuration
3+
description: "Advanced configuration options for Fusion Snapshots"
4+
date created: "2024-11-29"
5+
last updated: "2025-12-19"
6+
tags: [fusion, fusion-snapshots, snapshot, configuration, nextflow]
7+
---
8+
9+
Fusion Snapshots work optimally with default configuration for most workloads. You typically do not need to modify these settings unless you have specific organizational policies, experience issues with default behavior, or have edge case requirements.
10+
11+
:::tip
12+
For troubleshooting, focus on task memory usage and instance selection before adjusting these advanced configuration options. See [Troubleshooting](../../troubleshooting.md) for more information.
13+
:::
14+
15+
## Retry handling
16+
17+
When Spot instances are reclaimed, you can configure how Nextflow retries the tasks. There are two approaches:
18+
19+
- [Automatic retries with `maxSpotAttempts`](#automatic-retries-with-maxspotattempts)
20+
- [Fine-grained retries with `errorStrategy`](#fine-grained-retries-with-errorstrategy)
21+
22+
### Automatic retries with `maxSpotAttempts`
23+
24+
The simplest approach uses `maxSpotAttempts` to automatically retry any task that fails due to spot reclamation, regardless of the specific failure reason. When you enable Fusion Snapshots, Nextflow automatically sets `maxSpotAttempts = 5`. This allows the checkpoint to be restored on a new instance after reclamation up to 5 times.
25+
26+
**Increase retries**
27+
28+
If you experience frequent Spot reclamations, increase `maxSpotAttempts` above `5`:
29+
30+
- AWS Batch:
31+
32+
```groovy
33+
aws.batch.maxSpotAttempts = 10
34+
```
35+
36+
- Google Cloud Batch:
37+
38+
```groovy
39+
google.batch.maxSpotAttempts = 10
40+
```
41+
42+
**Disable retries**
43+
44+
To disable automatic retries, set `maxSpotAttempts = 0`:
45+
46+
- AWS Batch:
47+
48+
```groovy
49+
aws.batch.maxSpotAttempts = 0
50+
```
51+
52+
- Google Cloud Batch:
53+
54+
```groovy
55+
google.batch.maxSpotAttempts = 0
56+
```
57+
58+
### Fine-grained retries with `errorStrategy`
59+
60+
For fine-grained control of retries, configure your Nextflow [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to implement retry logic based on specific checkpoint failure types. This allows you to handle different failure scenarios (e.g., checkpoint dump failures differently from restore failures) differently.
61+
62+
To configure, set to `maxSpotAttempts = 0` and add an [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to your process configuration. For example:
63+
64+
```groovy
65+
process {
66+
maxRetries = 2
67+
errorStrategy = {
68+
if (task.exitStatus == 175) {
69+
return 'retry' // Retry checkpoint dump failures
70+
} else {
71+
return 'terminate' // Don't retry other failures
72+
}
73+
}
74+
}
75+
```
76+
77+
**Exit codes**:
78+
79+
- `175`: Checkpoint dump failed — The snapshot could not be saved (e.g., insufficient memory, I/O errors).
80+
- `176`: Checkpoint restore failed — The snapshot could not be restored on the new instance.
81+
82+
**Configuration options**:
83+
84+
See [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) for more configuration options.
85+
86+
## TCP connection handling
87+
88+
By default, Fusion Snapshots use `established` mode to preserve TCP connections during checkpoint operations. This works well for plain TCP connections. If your application uses SSL/TLS connections (HTTPS, SSH, etc.), you need to configure TCP close mode because CRIU cannot preserve encrypted connections.
89+
90+
To close all TCP connections during checkpoint operations, set:
91+
92+
```groovy
93+
process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close'
94+
```
95+
96+
**Options:**
97+
98+
- `established`: Preserve TCP connections (default).
99+
- `close`: Close all TCP connections during checkpoint.
100+
101+
## Debug logging
102+
103+
By default, Fusion Snapshots use `WARN` level logging (warnings and errors only). If you are troubleshooting checkpoint issues, you can enable more detailed logging to help diagnose problems.
104+
105+
To enable debug logging, set:
106+
107+
```groovy
108+
process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
109+
```
110+
111+
**Log levels**:
112+
113+
- `ERROR`: Only critical errors
114+
- `WARN`: Warnings and errors (default)
115+
- `INFO`: General informational messages
116+
- `DEBUG`: Detailed debug information
117+
118+
:::warning
119+
Use `debug` logging only when troubleshooting. It is verbose and may impact performance.
120+
:::
121+
122+
## Resource limits
123+
124+
By default, tasks can request any amount of resources. If a task requests more resources than are available on a single instance, the job waits indefinitely and never runs. Use the `process.resourceLimits` directive to set maximum requested resources below the capacity of a single instance.
125+
126+
Setting resource limits ensures tasks can checkpoint successfully and prevents jobs from becoming unschedulable. For example:
127+
128+
```groovy
129+
// AWS Batch example (120-second reclamation window)
130+
process.resourceLimits = [cpus: 32, memory: '60.GB']
131+
132+
// Google Cloud Batch example (Up to 30-second reclamation window - more conservative)
133+
process.resourceLimits = [cpus: 16, memory: '20.GB']
134+
```
135+
136+
See [AWS Batch](./aws.md) or [Google Cloud Batch](./gcp.md) for more information about reclamation windows. See [`resourceLimits`](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits) for more configuration options.

fusion_docs/guide/snapshots/gcp.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: Google Cloud Batch
3+
description: "Fusion Snapshots configuration and best practices for Google Cloud Batch"
4+
date created: "2024-11-29"
5+
last updated: "2025-12-19"
6+
tags: [fusion, fusion-snapshots, storage, compute, snapshot, gcp, google, batch]
7+
---
8+
9+
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Cloud Batch preemptible instances. When a preemption occurs, Google Batch provides up to 30 seconds before instance termination.
10+
11+
:::note
12+
When using Google Cloud Batch, Fusion Snapshots is currently only available for Seqera Cloud.
13+
:::
14+
15+
:::warning
16+
Google Cloud [guarantees only up to 30 seconds](https://cloud.google.com/compute/docs/instances/spot) before instance termination. Careful instance selection and conservative memory planning are critical for successful checkpoints.
17+
:::
18+
19+
## Seqera Platform compute environment requirements
20+
21+
Fusion Snapshots require the following Seqera Platform compute environment configuration:
22+
23+
- **Provider**: Google Batch
24+
- **Work directory**: GCS bucket in the same region as compute resources
25+
- **Fusion**: Enabled
26+
- **Wave**: Enabled
27+
- **Fusion Snapshots (beta)**: Enabled
28+
- **Provisioning model**: Spot
29+
30+
:::tip Configuration
31+
Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md).
32+
:::
33+
34+
## Incremental snapshots
35+
36+
[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on x86_64 instances and capture only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window. Use x86_64 instances to enable incremental snapshots.
37+
38+
## Resource limits
39+
40+
A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information.
41+
42+
## Manual cleanup
43+
44+
The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.

0 commit comments

Comments
 (0)