Skip to content

Commit a9dc7ac

Browse files
committed
docs: Fusion Snapshots incremental dumps and GCP support
Some restructuring on the Snapshots documentation as well to accomodate the fact that we went multi cloud now.
1 parent 1320507 commit a9dc7ac

File tree

6 files changed

+638
-47
lines changed

6 files changed

+638
-47
lines changed
Lines changed: 77 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,90 @@
11
---
22
title: Fusion Snapshots for AWS Batch
3-
description: "Overview of the Fusion Snapshots feature for AWS Batch"
4-
date: "23 Aug 2024"
5-
tags: [fusion, storage, compute, file system, snapshot]
3+
description: "Fusion Snapshots configuration and best practices for AWS Batch"
4+
date: "21 Nov 2024"
5+
tags: [fusion, storage, compute, snapshot, aws, batch]
66
---
77

8-
Fusion Snapshots enable you to create a checkpoint of a running Nextflow process and restore it on a different machine. Leveraging the Fusion file system, you can move the task between different instances connected to the same S3 bucket.
8+
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a **guaranteed 120-second warning window** to checkpoint and save the task state before the instance terminates.
99

10-
More specifically, the first use case for this feature is for Seqera Platform users to leverage AWS Spot instances without restarting an entire task from scratch if an instance is terminated. When a Spot instance interruption occurs, the task is restored from the last checkpoint on a new AWS instance, saving time and computational resources.
10+
## Requirements
1111

12-
### Seqera Platform compute environment requirements
13-
14-
Fusion Snapshots v1 requires the following [Seqera compute environment](https://docs.seqera.io/platform-cloud/compute-envs/aws-batch) configuration:
12+
### Seqera Platform compute environment
1513

1614
- **Provider**: AWS Batch
17-
- **Work directory**: An S3 bucket located in the same region as your AWS Batch compute resources
15+
- **Work directory**: S3 bucket in the same region as compute resources
1816
- **Enable Fusion Snapshots (beta)**
1917
- **Config mode**: Batch Forge
20-
- **Provisioning model**: Spot
21-
- **Instance types**: See recommended instance sizes below
18+
- **Provisioning model**: Spot
19+
- **Instance types**: See recommendations below
20+
21+
:::tip Configuration
22+
Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For advanced configuration options like changing retry behavior or TCP handling, see the [Configuration Guide](configuration.md).
23+
:::
24+
25+
### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI
26+
27+
To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI.
28+
29+
:::note
30+
Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default.
31+
:::
32+
33+
To find the recommended AL2023 ECS-optimized AMI for your region, run the following (replace eu-central-1 with your AWS region):
34+
35+
36+
```bash
37+
export REGION=eu-central-1
38+
aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
39+
```
40+
41+
The result for the `eu-central-1` region is similar to the following:
42+
43+
```json
44+
{
45+
"Parameter": {
46+
"Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
47+
"Type": "String",
48+
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
49+
"Version": 61,
50+
"LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
51+
"ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
52+
"DataType": "text"
53+
}
54+
```
55+
56+
Note the `image_id` in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment.
2257

23-
When Fusion Snapshots are enabled, the Nextflow Spot reclamation retry setting automatically defaults to `aws.batch.maxSpotAttempts = 5`.
58+
## Incremental snapshots
2459

25-
### EC2 instance selection guidelines
60+
Incremental snapshots are enabled by default on amd64 instances, capturing only changed memory pages between checkpoints for faster operations and reduced data transfer.
61+
62+
## EC2 instance selection
63+
64+
AWS provides a **guaranteed 120-second reclamation window**. Choose instances that can transfer checkpoint data within this time frame. Checkpoint time is primarily determined by memory usage, though other factors like number of open file descriptors also contribute.
65+
66+
### Key guidelines
67+
68+
- Select instances with **guaranteed** network bandwidth (not "up to" values)
69+
- Maintain a **5:1 ratio** between memory (GiB) and network bandwidth (Gbps)
70+
- Prefer **NVMe storage** instances (`d` suffix: `c6id`, `r6id`, `m6id`)
71+
- Use **x86_64** instances for incremental snapshot support
72+
73+
### Recommended instance types
2674

2775
- Choose EC2 Spot instances with sufficient memory and network bandwidth to dump the cache of task intermediate files to S3 storage before AWS terminates an instance.
2876
- Select instances with guaranteed network bandwidth (not instances with bandwidth "up to" a maximum value).
2977
- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
3078
- Recommended instance families: `c6id`, `r6id`, or `m6id` series instances work optimally with Fusion fast instance storage.
3179

32-
:::info Example
80+
:::info With incremental snapshots
3381
A c6id.8xlarge instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds, well within the 2-minute reclamation window.
3482

3583
Instances with memory:bandwitdth ratios over 5:1 may not complete transfers before termination, potentially resulting in task failures.
3684
:::
3785

38-
#### Recommended instance types
3986

40-
| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. Snapshot time|
87+
| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. snapshot time|
4188
|----------------|-------|--------------|--------------------------|------------------------|-------------------|
4289
| c6id.4xlarge | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds |
4390
| c6id.8xlarge | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds |
@@ -47,42 +94,26 @@ Instances with memory:bandwitdth ratios over 5:1 may not complete transfers befo
4794
| r6id.4xlarge | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds |
4895
| m6id.8xlarge | 32 | 128 | 25 | 5.12:1 | ~70 seconds |
4996

50-
It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a c6id.8xlarge machine, you could set the following:
51-
52-
```groovy
53-
process.resourceLimits = [cpus: 32, memory: '60.GB']
54-
```
5597

56-
Note that if a process requests the maximum CPUs and memory in the table above, it will not be satisfiable by a single instance and therefore fail to be assigned to a machine.
98+
### Resource limits
5799

58-
### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI
100+
It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a `c6id.8xlarge` machine, you could set the following:
59101

60-
To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI.
61102

62-
:::note
63-
Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default.
64-
:::
103+
```groovy
104+
process.resourceLimits = [cpus: 32, memory: '60.GB']
105+
```
65106

66-
To find the recommended AL2023 ECS-optimized AMI for your region, run the following (replace `eu-central-1` with your AWS region):
107+
## Best practices
67108

68-
```bash
69-
export REGION=eu-central-1
70-
aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
71-
```
109+
- **Instance selection**: Prefer instances with 5:1 or lower memory:bandwidth ratio
110+
- **Architecture**: Use x86_64 instances to enable incremental snapshots
72111

73-
The result for the `eu-central-1` region is similar to the following:
112+
## Troubleshooting
74113

75-
```bash
76-
{
77-
"Parameter": {
78-
"Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
79-
"Type": "String",
80-
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
81-
"Version": 61,
82-
"LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
83-
"ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
84-
"DataType": "text"
85-
}
86-
```
114+
- **Exit code 175**: Dump failed, likely due to timeout. Reduce memory usage or increase bandwidth.
115+
- **Exit code 176**: Restore failed. Check logs and verify checkpoint data integrity.
116+
- **Long checkpoint times**: Review instance bandwidth, consider x86_64 for incremental snapshots.
117+
- **State stuck in DUMPING**: Previous checkpoint exceeded reclamation window.
87118

88-
Note the `image_id` in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment.
119+
For detailed troubleshooting, see [Troubleshooting Guide](troubleshooting.md).
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: Fusion Snapshots Configuration
3+
description: "Advanced configuration options for Fusion Snapshots"
4+
date: "22 Nov 2024"
5+
tags: [fusion, snapshot, configuration, nextflow]
6+
---
7+
8+
Fusion Snapshots are designed with sensible defaults and typically require no additional configuration. The settings below are provided for edge cases and advanced use scenarios.
9+
10+
:::note
11+
You likely don't need to change these settings. Fusion Snapshots work out of the box with optimal defaults for most workloads.
12+
:::
13+
14+
## Spot reclamation retries
15+
16+
Control how many times Nextflow automatically retries a task after spot instance reclamation.
17+
18+
**Default with Fusion Snapshots enabled**: `5` (automatic retries on spot reclamation)
19+
**Default without Fusion Snapshots**: `0` (no automatic retries)
20+
21+
:::info
22+
When Fusion Snapshots are enabled, Nextflow automatically sets `maxSpotAttempts = 5` to enable automatic retry on spot reclamation. This allows the checkpoint to be restored on a new instance after reclamation.
23+
:::
24+
25+
### AWS Batch
26+
27+
```groovy
28+
aws.batch.maxSpotAttempts = 10 // Increase retries beyond default of 5
29+
aws.batch.maxSpotAttempts = 0 // Disable automatic retries
30+
```
31+
32+
### Google Batch
33+
34+
```groovy
35+
google.batch.maxSpotAttempts = 10 // Increase retries beyond default of 5
36+
google.batch.maxSpotAttempts = 0 // Disable automatic retries
37+
```
38+
39+
**When to customize**:
40+
- Increase above 5 if you expect frequent spot reclamations
41+
- Set to 0 if you want to handle retries only through error strategies
42+
- Most users should keep the default (5)
43+
44+
## Error retry strategy
45+
46+
Configure how Nextflow handles checkpoint failures. This is the recommended approach for retry logic.
47+
48+
```groovy
49+
process {
50+
maxRetries = 2
51+
errorStrategy = {
52+
if (task.exitStatus == 175) {
53+
return 'retry' // Retry on checkpoint dump failure
54+
} else {
55+
return 'terminate'
56+
}
57+
}
58+
}
59+
```
60+
61+
**Exit codes**:
62+
- `175`: Checkpoint dump failed
63+
- `176`: Restore failed
64+
65+
**Why this is better**: Using error strategy gives you fine-grained control over retry behavior per exit code, while `maxSpotAttempts` retries all spot reclamations regardless of the cause.
66+
67+
## TCP connection handling
68+
69+
Control how TCP connections are handled during checkpoint operations.
70+
71+
**Default**: `established` (preserve TCP connections)
72+
73+
```groovy
74+
process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close'
75+
```
76+
77+
**Options**:
78+
- `established`: Preserve TCP connections (default, for plain TCP)
79+
- `close`: Close all TCP connections during checkpoint
80+
81+
**When to use**: Set to `close` if your application uses SSL/TLS connections (HTTPS, SSH, etc.), as CRIU cannot preserve encrypted connections.
82+
83+
## Debug logging
84+
85+
Enable detailed logging for troubleshooting.
86+
87+
**Default**: `WARN` (warnings and errors only)
88+
89+
```groovy
90+
process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
91+
```
92+
93+
**Log levels**:
94+
- `ERROR`: Only critical errors
95+
- `WARN`: Warnings and errors (default)
96+
- `INFO`: General informational messages
97+
- `DEBUG`: Detailed debug information
98+
99+
**When to use**: Only for troubleshooting checkpoint issues. Debug logs are verbose and may impact performance.
100+
101+
## Resource limits
102+
103+
Prevent tasks from requesting resources that exceed available capacity.
104+
105+
```groovy
106+
// AWS Batch example
107+
process.resourceLimits = [cpus: 32, memory: '60.GB']
108+
109+
// Google Batch example (more conservative for 30s window)
110+
process.resourceLimits = [cpus: 16, memory: '20.GB']
111+
```
112+
113+
**When to use**:
114+
- Prevent jobs from becoming unschedulable
115+
- Ensure tasks can checkpoint within reclamation windows
116+
- Enforce organization-wide resource policies
117+
118+
See [AWS Batch](aws.md#best-practices) or [Google Batch](gcp.md#best-practices) for recommended memory limits.
119+
120+
## Summary
121+
122+
Most users never need to modify these settings. Fusion Snapshots are designed to work optimally with default configuration. Only adjust these settings if:
123+
124+
- You have specific organizational policies
125+
- You're experiencing issues with the default behavior
126+
- You have edge case requirements not covered by defaults
127+
128+
For most troubleshooting scenarios, focus on task memory usage and instance selection rather than configuration changes.

fusion_docs/guide/snapshots/gcp.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
title: Fusion Snapshots for Google Batch
3+
description: "Fusion Snapshots configuration and best practices for Google Batch"
4+
date: "21 Nov 2024"
5+
tags: [fusion, storage, compute, snapshot, gcp, google, batch]
6+
---
7+
8+
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Batch preemptible instances. When a preemption occurs, the task is checkpointed and restored on a new instance.
9+
10+
:::warning Short and variable reclamation window
11+
Google Batch provides **up to 30 seconds** before instance termination (not guaranteed - could be less), significantly shorter than AWS Batch's guaranteed 120 seconds. Careful instance selection and conservative memory planning are critical for successful checkpoints.
12+
:::
13+
14+
## Requirements
15+
16+
### Seqera Platform compute environment
17+
18+
- **Provider**: Google Batch
19+
- **Work directory**: GCS bucket in the same region as compute resources
20+
- **Enable Fusion Snapshots (beta)**
21+
- **Provisioning model**: Spot
22+
23+
:::tip Configuration
24+
Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For advanced configuration options like changing retry behavior or TCP handling, see the [Configuration Guide](configuration.md).
25+
:::
26+
27+
## Incremental snapshots
28+
29+
Incremental snapshots are enabled by default on X86_64 instances, capturing only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window.
30+
31+
### Key guidelines
32+
33+
- Use **x86_64** instances (incremental snapshots enabled by default)
34+
35+
36+
### Resource limits
37+
38+
It's possible for a single job to request more resources than are available on a single instance. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration.
39+
40+
41+
42+
```groovy
43+
process.resourceLimits = [cpus: 16, memory: '20.GB']
44+
```
45+
46+
## Troubleshooting
47+
48+
- **Exit code 175**: Dump failed, likely due to timeout because too much memory is used and cannot be dumped fast enough.
49+
- **Exit code 176**: Restore failed. Check logs and verify checkpoint data integrity.
50+
51+
For detailed troubleshooting, see [Troubleshooting Guide](troubleshooting.md).

0 commit comments

Comments
 (0)