-
Notifications
You must be signed in to change notification settings - Fork 5
docs: Fusion Snapshots incremental dumps and GCP support #929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
64ed543
docs: Fusion Snapshots incremental dumps and GCP support
fntlnz 631232c
Apply changes from review
christopher-hakkaart 5f922f5
Merge branch 'master' into lf/fusion-snapshots
christopher-hakkaart 4c04118
Make changes to troubleshooting, and reorder sections
christopher-hakkaart 23155a6
Move last sections over
christopher-hakkaart 93a224d
Merge branch 'lf/fusion-snapshots' of github.com:seqeralabs/docs into…
christopher-hakkaart 580567d
Revise headings
christopher-hakkaart 9631d9a
Add connective links
christopher-hakkaart a55bcfb
Reframe contact section
christopher-hakkaart 248faf0
Fix link
christopher-hakkaart bdb36df
Fix link
christopher-hakkaart dc89884
Add revisions
christopher-hakkaart 552a6bb
Merge branch 'master' into lf/fusion-snapshots
christopher-hakkaart a3d6295
Code exit codes
christopher-hakkaart fecee51
Merge branch 'lf/fusion-snapshots' of github.com:seqeralabs/docs into…
christopher-hakkaart 1233d2b
Add extra heading
christopher-hakkaart d807d97
Split admonition into two sections
christopher-hakkaart ff8bfd7
Add clean up section
christopher-hakkaart f5dbd0a
Update fusion_docs/guide/snapshots/gcp.md
MichaelTansiniSeqera 05b214a
Update fusion_docs/guide/snapshots/index.md
MichaelTansiniSeqera b70f909
Update fusion_docs/guide/snapshots/gcp.md
MichaelTansiniSeqera 7995499
Update fusion_docs/guide/snapshots/gcp.md
christopher-hakkaart 2c47ce5
Merge branch 'master' into lf/fusion-snapshots
christopher-hakkaart b77bc0f
Update fusion_docs/troubleshooting.md
christopher-hakkaart 11ac471
Update fusion_docs/troubleshooting.md
justinegeffen 86d4d2f
Merge branch 'master' into lf/fusion-snapshots
justinegeffen 2a767d3
Update aws.md
justinegeffen 35c302a
Update configuration.md
justinegeffen 1f0e367
Update gcp.md
justinegeffen 3cfc2b0
Revise Fusion Snapshots documentation
justinegeffen 0dc30d7
Update troubleshooting.md
justinegeffen 2adde32
Update fusion_docs/guide/snapshots/gcp.md
justinegeffen ae59639
Update troubleshooting.md
justinegeffen 912aad7
Apply suggestion from @justinegeffen
justinegeffen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| --- | ||
| title: AWS Batch | ||
| description: "Fusion Snapshots configuration and best practices for AWS Batch" | ||
| date created: "2024-11-21" | ||
| last updated: "2025-12-19" | ||
| tags: [fusion, fusion-snapshots, storage, compute, snapshot, aws, batch] | ||
| --- | ||
|
|
||
| Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a guaranteed 120-second warning window to checkpoint and save the task state before the instance terminates. | ||
|
|
||
| ## Seqera Platform compute environment requirements | ||
|
|
||
| Fusion Snapshots require the following Seqera Platform compute environment configuration: | ||
|
|
||
| - **Provider:** AWS Batch | ||
| - **Work directory:** S3 bucket in the same region as compute resources | ||
| - **Fusion Snapshots (beta):** Enabled | ||
| - **Config mode:** Batch Forge | ||
| - **Provisioning model:** Spot | ||
| - **AMI:** See [Selecting an AMI](#selecting-an-ami) for details | ||
| - **Instance type:** See [Selecting an EC2 instance](#selecting-an-ec2-instance) for details | ||
|
|
||
| :::tip | ||
| Fusion Snapshots work with sensible defaults (e.g., 5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md). | ||
| ::: | ||
|
|
||
| ### Selecting an AMI | ||
|
|
||
| Fusion Snapshots require instances running Amazon Linux 2023 (which ships with Linux Kernel 6.1) and an ECS container-optimized AMI for optimal performance. | ||
|
|
||
| #### Seqera Cloud | ||
|
|
||
| Seqera Cloud AWS Batch compute environments use an ECS container-optimized AMI by default. No additional AMI configuration is required. | ||
|
|
||
| #### Seqera Enterprise | ||
|
|
||
| Specify an Amazon Linux 2023 ECS-optimized AMI for your region when creating your compute environment. | ||
|
|
||
| To find the recommended AMI: | ||
|
|
||
| 1. Retrieve the application configuration: | ||
|
|
||
| ```bash | ||
| export REGION=<AWS_REGION> | ||
| aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION | ||
| ``` | ||
|
|
||
| Replace `<AWS_REGION>` with your AWS region (for example, `eu-central-1`). | ||
|
|
||
| The output for the `eu-central-1` region is similar to the following: | ||
|
|
||
| ```json | ||
| { | ||
| "Parameter": { | ||
| "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended", | ||
| "Type": "String", | ||
| "Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}", | ||
| "Version": 61, | ||
| "LastModifiedDate": "2024-11-18T17:08:46.926000+01:00", | ||
| "ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended", | ||
| "DataType": "text" | ||
| } | ||
| ``` | ||
|
|
||
| 1. Identify the `image_id` in your output (e.g, `ami-0281c9a5cd9de63bd` in the above example) and set in the **Advanced options > AMI ID** field when you create your Seqera compute environment. | ||
|
|
||
| ## Selecting an EC2 instance | ||
|
|
||
| AWS provides a guaranteed 120-second reclamation window. Select instance types that can transfer checkpoint data within this timeframe. Checkpoint time is primarily determined by memory usage. Other factors like the number of open file descriptors also affect performance. | ||
|
|
||
| When you select an EC2 instance: | ||
|
|
||
| - Select instances with guaranteed network bandwidth, not "up to" values. | ||
| - Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps). | ||
| - Prefer NVMe storage instances (those with a `d` suffix: `c6id`, `r6id`, `m6id`). | ||
| - Use `x86_64` instances for [incremental snapshots](./index.md#incremental-snapshots). | ||
|
|
||
| For example, a `c6id.8xlarge` instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds. Instances with memory:bandwidth ratios over 5:1 may not complete transfers before termination and risk task failures. | ||
|
|
||
| | Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Estimated snapshot time | | ||
| |----------------|-------|--------------|--------------------------|------------------------|-------------------------| | ||
| | `c6id.4xlarge` | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds | | ||
| | `c6id.8xlarge` | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds | | ||
| | `r6id.2xlarge` | 8 | 16 | 12.5 | 1.28:1 | ~20 seconds | | ||
| | `m6id.4xlarge` | 16 | 64 | 12.5 | 5.12:1 | ~70 seconds | | ||
| | `c6id.12xlarge`| 48 | 96 | 18.75 | 5.12:1 | ~70 seconds | | ||
| | `r6id.4xlarge` | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds | | ||
| | `m6id.8xlarge` | 32 | 128 | 25 | 5.12:1 | ~70 seconds | | ||
|
|
||
| :::info | ||
| [Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on `x86_64` instances. | ||
| ::: | ||
|
|
||
| ## Resource limits | ||
|
|
||
| A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information. | ||
|
|
||
| ## Manual cleanup | ||
|
|
||
| The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| --- | ||
| title: Advanced configuration | ||
| description: "Advanced configuration options for Fusion Snapshots" | ||
| date created: "2024-11-29" | ||
| last updated: "2025-12-19" | ||
| tags: [fusion, fusion-snapshots, snapshot, configuration, nextflow] | ||
| --- | ||
|
|
||
| Fusion Snapshots work optimally with default configuration for most workloads. You typically do not need to modify these settings unless you have specific organizational policies, experience issues with default behavior, or have edge case requirements. | ||
|
|
||
| :::tip | ||
| For troubleshooting, focus on task memory usage and instance selection before adjusting these advanced configuration options. See [Troubleshooting](../../troubleshooting.md) for more information. | ||
| ::: | ||
|
|
||
| ## Retry handling | ||
|
|
||
| When Spot instances are reclaimed, you can configure how Nextflow retries the tasks. There are two approaches: | ||
|
|
||
| - [Automatic retries with `maxSpotAttempts`](#automatic-retries-with-maxspotattempts) | ||
| - [Fine-grained retries with `errorStrategy`](#fine-grained-retries-with-errorstrategy) | ||
|
|
||
| ### Automatic retries with `maxSpotAttempts` | ||
|
|
||
| The simplest approach uses `maxSpotAttempts` to automatically retry any task that fails due to spot reclamation, regardless of the specific failure reason. When you enable Fusion Snapshots, Nextflow automatically sets `maxSpotAttempts = 5`. This allows the checkpoint to be restored on a new instance after reclamation up to 5 times. | ||
|
|
||
| **Increase retries** | ||
|
|
||
| If you experience frequent Spot reclamations, increase `maxSpotAttempts` above `5`: | ||
|
|
||
| - AWS Batch: | ||
|
|
||
| ```groovy | ||
| aws.batch.maxSpotAttempts = 10 | ||
| ``` | ||
|
|
||
| - Google Cloud Batch: | ||
|
|
||
| ```groovy | ||
| google.batch.maxSpotAttempts = 10 | ||
| ``` | ||
|
|
||
| **Disable retries** | ||
|
|
||
| To disable automatic retries, set `maxSpotAttempts = 0`: | ||
|
|
||
| - AWS Batch: | ||
|
|
||
| ```groovy | ||
| aws.batch.maxSpotAttempts = 0 | ||
| ``` | ||
|
|
||
| - Google Cloud Batch: | ||
|
|
||
| ```groovy | ||
| google.batch.maxSpotAttempts = 0 | ||
| ``` | ||
|
|
||
| ### Fine-grained retries with `errorStrategy` | ||
|
|
||
| For fine-grained control of retries, configure your Nextflow [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to implement retry logic based on specific checkpoint failure types. This allows you to handle different failure scenarios (e.g., checkpoint dump failures differently from restore failures) differently. | ||
|
|
||
| To configure, set to `maxSpotAttempts = 0` and add an [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to your process configuration. For example: | ||
|
|
||
| ```groovy | ||
| process { | ||
| maxRetries = 2 | ||
| errorStrategy = { | ||
| if (task.exitStatus == 175) { | ||
| return 'retry' // Retry checkpoint dump failures | ||
| } else { | ||
| return 'terminate' // Don't retry other failures | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Exit codes**: | ||
|
|
||
| - `175`: Checkpoint dump failed — The snapshot could not be saved (e.g., insufficient memory, I/O errors). | ||
| - `176`: Checkpoint restore failed — The snapshot could not be restored on the new instance. | ||
|
|
||
| **Configuration options**: | ||
|
|
||
| See [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) for more configuration options. | ||
|
|
||
| ## TCP connection handling | ||
|
|
||
| By default, Fusion Snapshots use `established` mode to preserve TCP connections during checkpoint operations. This works well for plain TCP connections. If your application uses SSL/TLS connections (HTTPS, SSH, etc.), you need to configure TCP close mode because CRIU cannot preserve encrypted connections. | ||
|
|
||
| To close all TCP connections during checkpoint operations, set: | ||
|
|
||
| ```groovy | ||
| process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close' | ||
| ``` | ||
|
|
||
| **Options:** | ||
|
|
||
| - `established`: Preserve TCP connections (default). | ||
| - `close`: Close all TCP connections during checkpoint. | ||
|
|
||
| ## Debug logging | ||
|
|
||
| By default, Fusion Snapshots use `WARN` level logging (warnings and errors only). If you are troubleshooting checkpoint issues, you can enable more detailed logging to help diagnose problems. | ||
|
|
||
| To enable debug logging, set: | ||
|
|
||
| ```groovy | ||
| process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug' | ||
| ``` | ||
|
|
||
| **Log levels**: | ||
|
|
||
| - `ERROR`: Only critical errors | ||
| - `WARN`: Warnings and errors (default) | ||
| - `INFO`: General informational messages | ||
| - `DEBUG`: Detailed debug information | ||
|
|
||
| :::warning | ||
| Use `debug` logging only when troubleshooting. It is verbose and may impact performance. | ||
| ::: | ||
|
|
||
| ## Resource limits | ||
|
|
||
| By default, tasks can request any amount of resources. If a task requests more resources than are available on a single instance, the job waits indefinitely and never runs. Use the `process.resourceLimits` directive to set maximum requested resources below the capacity of a single instance. | ||
|
|
||
| Setting resource limits ensures tasks can checkpoint successfully and prevents jobs from becoming unschedulable. For example: | ||
|
|
||
| ```groovy | ||
| // AWS Batch example (120-second reclamation window) | ||
| process.resourceLimits = [cpus: 32, memory: '60.GB'] | ||
|
|
||
| // Google Cloud Batch example (Up to 30-second reclamation window - more conservative) | ||
| process.resourceLimits = [cpus: 16, memory: '20.GB'] | ||
| ``` | ||
|
|
||
| See [AWS Batch](./aws.md) or [Google Cloud Batch](./gcp.md) for more information about reclamation windows. See [`resourceLimits`](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits) for more configuration options. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| --- | ||
| title: Google Cloud Batch | ||
| description: "Fusion Snapshots configuration and best practices for Google Cloud Batch" | ||
| date created: "2024-11-29" | ||
| last updated: "2025-12-19" | ||
| tags: [fusion, fusion-snapshots, storage, compute, snapshot, gcp, google, batch] | ||
| --- | ||
|
|
||
| Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Cloud Batch preemptible instances. When a preemption occurs, Google Batch provides up to 30 seconds before instance termination. | ||
|
|
||
| :::note | ||
| When using Google Cloud Batch, Fusion Snapshots is currently only available for Seqera Cloud. | ||
| ::: | ||
|
|
||
| :::warning | ||
| Google Cloud [guarantees only up to 30 seconds](https://cloud.google.com/compute/docs/instances/spot) before instance termination. Careful instance selection and conservative memory planning are critical for successful checkpoints. | ||
| ::: | ||
|
|
||
| ## Seqera Platform compute environment requirements | ||
|
|
||
| Fusion Snapshots require the following Seqera Platform compute environment configuration: | ||
|
|
||
| - **Provider**: Google Batch | ||
| - **Work directory**: GCS bucket in the same region as compute resources | ||
| - **Fusion**: Enabled | ||
| - **Wave**: Enabled | ||
| - **Fusion Snapshots (beta)**: Enabled | ||
MichaelTansiniSeqera marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Provisioning model**: Spot | ||
|
|
||
| :::tip Configuration | ||
| Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md). | ||
| ::: | ||
|
|
||
| ## Incremental snapshots | ||
|
|
||
| [Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on x86_64 instances and capture only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window. Use x86_64 instances to enable incremental snapshots. | ||
justinegeffen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Resource limits | ||
|
|
||
| A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information. | ||
|
|
||
| ## Manual cleanup | ||
|
|
||
| The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.