You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fusion Snapshots enable you to create a checkpoint of a running Nextflow process and restore it on a different machine. Leveraging the Fusion file system, you can move the task between different instances connected to the same S3 bucket.
8
+
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a **guaranteed 120-second warning window** to checkpoint and save the task state before the instance terminates.
9
9
10
-
More specifically, the first use case for this feature is for Seqera Platform users to leverage AWS Spot instances without restarting an entire task from scratch if an instance is terminated. When a Spot instance interruption occurs, the task is restored from the last checkpoint on a new AWS instance, saving time and computational resources.
Fusion Snapshots v1 requires the following [Seqera compute environment](https://docs.seqera.io/platform-cloud/compute-envs/aws-batch) configuration:
12
+
### Seqera Platform compute environment
15
13
16
14
-**Provider**: AWS Batch
17
-
-**Work directory**: An S3 bucket located in the same region as your AWS Batch compute resources
15
+
-**Work directory**: S3 bucket in the same region as compute resources
18
16
-**Enable Fusion Snapshots (beta)**
19
17
-**Config mode**: Batch Forge
20
-
-**Provisioning model**: Spot
21
-
-**Instance types**: See recommended instance sizes below
18
+
-**Provisioning model**: Spot
19
+
-**Instance types**: See recommendations below
20
+
21
+
:::tip Configuration
22
+
Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For advanced configuration options like changing retry behavior or TCP handling, see the [Configuration Guide](configuration.md).
23
+
:::
24
+
25
+
### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI
26
+
27
+
To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI.
28
+
29
+
:::note
30
+
Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default.
31
+
:::
32
+
33
+
To find the recommended AL2023 ECS-optimized AMI for your region, run the following (replace eu-central-1 with your AWS region):
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
Note the `image_id` in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment.
22
57
23
-
When Fusion Snapshots are enabled, the Nextflow Spot reclamation retry setting automatically defaults to `aws.batch.maxSpotAttempts = 5`.
58
+
## Incremental snapshots
24
59
25
-
### EC2 instance selection guidelines
60
+
Incremental snapshots are enabled by default on amd64 instances, capturing only changed memory pages between checkpoints for faster operations and reduced data transfer.
61
+
62
+
## EC2 instance selection
63
+
64
+
AWS provides a **guaranteed 120-second reclamation window**. Choose instances that can transfer checkpoint data within this time frame. Checkpoint time is primarily determined by memory usage, though other factors like number of open file descriptors also contribute.
- Use **x86_64** instances for incremental snapshot support
72
+
73
+
### Recommended instance types
26
74
27
75
- Choose EC2 Spot instances with sufficient memory and network bandwidth to dump the cache of task intermediate files to S3 storage before AWS terminates an instance.
28
76
- Select instances with guaranteed network bandwidth (not instances with bandwidth "up to" a maximum value).
29
77
- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
30
78
- Recommended instance families: `c6id`, `r6id`, or `m6id` series instances work optimally with Fusion fast instance storage.
31
79
32
-
:::info Example
80
+
:::info With incremental snapshots
33
81
A c6id.8xlarge instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds, well within the 2-minute reclamation window.
34
82
35
83
Instances with memory:bandwitdth ratios over 5:1 may not complete transfers before termination, potentially resulting in task failures.
36
84
:::
37
85
38
-
#### Recommended instance types
39
86
40
-
| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. Snapshot time|
87
+
| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. snapshot time|
It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a c6id.8xlarge machine, you could set the following:
Note that if a process requests the maximum CPUs and memory in the table above, it will not be satisfiable by a single instance and therefore fail to be assigned to a machine.
98
+
### Resource limits
57
99
58
-
### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI
100
+
It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a `c6id.8xlarge` machine, you could set the following:
59
101
60
-
To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI.
61
102
62
-
:::note
63
-
Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default.
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
- **State stuck in DUMPING**: Previous checkpoint exceeded reclamation window.
87
118
88
-
Note the `image_id`in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment.
119
+
For detailed troubleshooting, see [Troubleshooting Guide](troubleshooting.md).
description: "Advanced configuration options for Fusion Snapshots"
4
+
date: "22 Nov 2024"
5
+
tags: [fusion, snapshot, configuration, nextflow]
6
+
---
7
+
8
+
Fusion Snapshots are designed with sensible defaults and typically require no additional configuration. The settings below are provided for edge cases and advanced use scenarios.
9
+
10
+
:::note
11
+
You likely don't need to change these settings. Fusion Snapshots work out of the box with optimal defaults for most workloads.
12
+
:::
13
+
14
+
## Spot reclamation retries
15
+
16
+
Control how many times Nextflow automatically retries a task after spot instance reclamation.
17
+
18
+
**Default with Fusion Snapshots enabled**: `5` (automatic retries on spot reclamation)
19
+
**Default without Fusion Snapshots**: `0` (no automatic retries)
20
+
21
+
:::info
22
+
When Fusion Snapshots are enabled, Nextflow automatically sets `maxSpotAttempts = 5` to enable automatic retry on spot reclamation. This allows the checkpoint to be restored on a new instance after reclamation.
23
+
:::
24
+
25
+
### AWS Batch
26
+
27
+
```groovy
28
+
aws.batch.maxSpotAttempts = 10 // Increase retries beyond default of 5
- Increase above 5 if you expect frequent spot reclamations
41
+
- Set to 0 if you want to handle retries only through error strategies
42
+
- Most users should keep the default (5)
43
+
44
+
## Error retry strategy
45
+
46
+
Configure how Nextflow handles checkpoint failures. This is the recommended approach for retry logic.
47
+
48
+
```groovy
49
+
process {
50
+
maxRetries = 2
51
+
errorStrategy = {
52
+
if (task.exitStatus == 175) {
53
+
return 'retry' // Retry on checkpoint dump failure
54
+
} else {
55
+
return 'terminate'
56
+
}
57
+
}
58
+
}
59
+
```
60
+
61
+
**Exit codes**:
62
+
-`175`: Checkpoint dump failed
63
+
-`176`: Restore failed
64
+
65
+
**Why this is better**: Using error strategy gives you fine-grained control over retry behavior per exit code, while `maxSpotAttempts` retries all spot reclamations regardless of the cause.
66
+
67
+
## TCP connection handling
68
+
69
+
Control how TCP connections are handled during checkpoint operations.
- Ensure tasks can checkpoint within reclamation windows
116
+
- Enforce organization-wide resource policies
117
+
118
+
See [AWS Batch](aws.md#best-practices) or [Google Batch](gcp.md#best-practices) for recommended memory limits.
119
+
120
+
## Summary
121
+
122
+
Most users never need to modify these settings. Fusion Snapshots are designed to work optimally with default configuration. Only adjust these settings if:
123
+
124
+
- You have specific organizational policies
125
+
- You're experiencing issues with the default behavior
126
+
- You have edge case requirements not covered by defaults
127
+
128
+
For most troubleshooting scenarios, focus on task memory usage and instance selection rather than configuration changes.
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Batch preemptible instances. When a preemption occurs, the task is checkpointed and restored on a new instance.
9
+
10
+
:::warning Short and variable reclamation window
11
+
Google Batch provides **up to 30 seconds** before instance termination (not guaranteed - could be less), significantly shorter than AWS Batch's guaranteed 120 seconds. Careful instance selection and conservative memory planning are critical for successful checkpoints.
12
+
:::
13
+
14
+
## Requirements
15
+
16
+
### Seqera Platform compute environment
17
+
18
+
-**Provider**: Google Batch
19
+
-**Work directory**: GCS bucket in the same region as compute resources
20
+
-**Enable Fusion Snapshots (beta)**
21
+
-**Provisioning model**: Spot
22
+
23
+
:::tip Configuration
24
+
Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For advanced configuration options like changing retry behavior or TCP handling, see the [Configuration Guide](configuration.md).
25
+
:::
26
+
27
+
## Incremental snapshots
28
+
29
+
Incremental snapshots are enabled by default on X86_64 instances, capturing only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window.
30
+
31
+
### Key guidelines
32
+
33
+
- Use **x86_64** instances (incremental snapshots enabled by default)
34
+
35
+
36
+
### Resource limits
37
+
38
+
It's possible for a single job to request more resources than are available on a single instance. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration.
0 commit comments