diff --git a/fusion_docs/guide/snapshots.md b/fusion_docs/guide/snapshots.md deleted file mode 100644 index bf0f4ce55..000000000 --- a/fusion_docs/guide/snapshots.md +++ /dev/null @@ -1,88 +0,0 @@ ---- -title: Fusion Snapshots for AWS Batch -description: "Overview of the Fusion Snapshots feature for AWS Batch" -date: "23 Aug 2024" -tags: [fusion, storage, compute, file system, snapshot] ---- - -Fusion Snapshots enable you to create a checkpoint of a running Nextflow process and restore it on a different machine. Leveraging the Fusion file system, you can move the task between different instances connected to the same S3 bucket. - -More specifically, the first use case for this feature is for Seqera Platform users to leverage AWS Spot instances without restarting an entire task from scratch if an instance is terminated. When a Spot instance interruption occurs, the task is restored from the last checkpoint on a new AWS instance, saving time and computational resources. - -### Seqera Platform compute environment requirements - -Fusion Snapshots v1 requires the following [Seqera compute environment](https://docs.seqera.io/platform-cloud/compute-envs/aws-batch) configuration: - -- **Provider**: AWS Batch -- **Work directory**: An S3 bucket located in the same region as your AWS Batch compute resources -- **Enable Fusion Snapshots (beta)** -- **Config mode**: Batch Forge -- **Provisioning model**: Spot -- **Instance types**: See recommended instance sizes below - -When Fusion Snapshots are enabled, the Nextflow Spot reclamation retry setting automatically defaults to `aws.batch.maxSpotAttempts = 5`. - -### EC2 instance selection guidelines - -- Choose EC2 Spot instances with sufficient memory and network bandwidth to dump the cache of task intermediate files to S3 storage before AWS terminates an instance. -- Select instances with guaranteed network bandwidth (not instances with bandwidth "up to" a maximum value). -- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps). -- Recommended instance families: `c6id`, `r6id`, or `m6id` series instances work optimally with Fusion fast instance storage. - -:::info Example -A c6id.8xlarge instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds, well within the 2-minute reclamation window. - -Instances with memory:bandwitdth ratios over 5:1 may not complete transfers before termination, potentially resulting in task failures. -::: - -#### Recommended instance types - -| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. Snapshot time| -|----------------|-------|--------------|--------------------------|------------------------|-------------------| -| c6id.4xlarge | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds | -| c6id.8xlarge | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds | -| r6id.2xlarge | 8 | 16 | 12.5 | 1.28:1 | ~20 seconds | -| m6id.4xlarge | 16 | 64 | 12.5 | 5.12:1 | ~70 seconds | -| c6id.12xlarge | 48 | 96 | 18.75 | 5.12:1 | ~70 seconds | -| r6id.4xlarge | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds | -| m6id.8xlarge | 32 | 128 | 25 | 5.12:1 | ~70 seconds | - -It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a c6id.8xlarge machine, you could set the following: - -```groovy -process.resourceLimits = [cpus: 32, memory: '60.GB'] -``` - -Note that if a process requests the maximum CPUs and memory in the table above, it will not be satisfiable by a single instance and therefore fail to be assigned to a machine. - -### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI - -To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI. - -:::note -Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default. -::: - -To find the recommended AL2023 ECS-optimized AMI for your region, run the following (replace `eu-central-1` with your AWS region): - -```bash -export REGION=eu-central-1 -aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION -``` - -The result for the `eu-central-1` region is similar to the following: - -```bash -{ - "Parameter": { - "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended", - "Type": "String", - "Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}", - "Version": 61, - "LastModifiedDate": "2024-11-18T17:08:46.926000+01:00", - "ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended", - "DataType": "text" - } -``` - -Note the `image_id` in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment. diff --git a/fusion_docs/guide/snapshots/aws.md b/fusion_docs/guide/snapshots/aws.md new file mode 100644 index 000000000..6cb6cb723 --- /dev/null +++ b/fusion_docs/guide/snapshots/aws.md @@ -0,0 +1,100 @@ +--- +title: AWS Batch +description: "Fusion Snapshots configuration and best practices for AWS Batch" +date created: "2024-11-21" +last updated: "2025-12-19" +tags: [fusion, fusion-snapshots, storage, compute, snapshot, aws, batch] +--- + +Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a guaranteed 120-second warning window to checkpoint and save the task state before the instance terminates. + +## Seqera Platform compute environment requirements + +Fusion Snapshots require the following Seqera Platform compute environment configuration: + +- **Provider:** AWS Batch +- **Work directory:** S3 bucket in the same region as compute resources +- **Fusion Snapshots (beta):** Enabled +- **Config mode:** Batch Forge +- **Provisioning model:** Spot +- **AMI:** See [Selecting an AMI](#selecting-an-ami) for details +- **Instance type:** See [Selecting an EC2 instance](#selecting-an-ec2-instance) for details + +:::tip +Fusion Snapshots work with sensible defaults (e.g., 5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md). +::: + +### Selecting an AMI + +Fusion Snapshots require instances running Amazon Linux 2023 (which ships with Linux Kernel 6.1) and an ECS container-optimized AMI for optimal performance. + +#### Seqera Cloud + +Seqera Cloud AWS Batch compute environments use an ECS container-optimized AMI by default. No additional AMI configuration is required. + +#### Seqera Enterprise + +Specify an Amazon Linux 2023 ECS-optimized AMI for your region when creating your compute environment. + +To find the recommended AMI: + +1. Retrieve the application configuration: + + ```bash + export REGION= + aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION + ``` + + Replace `` with your AWS region (for example, `eu-central-1`). + + The output for the `eu-central-1` region is similar to the following: + + ```json + { + "Parameter": { + "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended", + "Type": "String", + "Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}", + "Version": 61, + "LastModifiedDate": "2024-11-18T17:08:46.926000+01:00", + "ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended", + "DataType": "text" + } + ``` + +1. Identify the `image_id` in your output (e.g, `ami-0281c9a5cd9de63bd` in the above example) and set in the **Advanced options > AMI ID** field when you create your Seqera compute environment. + +## Selecting an EC2 instance + +AWS provides a guaranteed 120-second reclamation window. Select instance types that can transfer checkpoint data within this timeframe. Checkpoint time is primarily determined by memory usage. Other factors like the number of open file descriptors also affect performance. + +When you select an EC2 instance: + +- Select instances with guaranteed network bandwidth, not "up to" values. +- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps). +- Prefer NVMe storage instances (those with a `d` suffix: `c6id`, `r6id`, `m6id`). +- Use `x86_64` instances for [incremental snapshots](./index.md#incremental-snapshots). + +For example, a `c6id.8xlarge` instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds. Instances with memory:bandwidth ratios over 5:1 may not complete transfers before termination and risk task failures. + +| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Estimated snapshot time | +|----------------|-------|--------------|--------------------------|------------------------|-------------------------| +| `c6id.4xlarge` | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds | +| `c6id.8xlarge` | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds | +| `r6id.2xlarge` | 8 | 16 | 12.5 | 1.28:1 | ~20 seconds | +| `m6id.4xlarge` | 16 | 64 | 12.5 | 5.12:1 | ~70 seconds | +| `c6id.12xlarge`| 48 | 96 | 18.75 | 5.12:1 | ~70 seconds | +| `r6id.4xlarge` | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds | +| `m6id.8xlarge` | 32 | 128 | 25 | 5.12:1 | ~70 seconds | + +:::info +[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on `x86_64` instances. +::: + +## Resource limits + +A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information. + +## Manual cleanup + +The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary. diff --git a/fusion_docs/guide/snapshots/configuration.md b/fusion_docs/guide/snapshots/configuration.md new file mode 100644 index 000000000..322f7b67a --- /dev/null +++ b/fusion_docs/guide/snapshots/configuration.md @@ -0,0 +1,136 @@ +--- +title: Advanced configuration +description: "Advanced configuration options for Fusion Snapshots" +date created: "2024-11-29" +last updated: "2025-12-19" +tags: [fusion, fusion-snapshots, snapshot, configuration, nextflow] +--- + +Fusion Snapshots work optimally with default configuration for most workloads. You typically do not need to modify these settings unless you have specific organizational policies, experience issues with default behavior, or have edge case requirements. + +:::tip +For troubleshooting, focus on task memory usage and instance selection before adjusting these advanced configuration options. See [Troubleshooting](../../troubleshooting.md) for more information. +::: + +## Retry handling + +When Spot instances are reclaimed, you can configure how Nextflow retries the tasks. There are two approaches: + +- [Automatic retries with `maxSpotAttempts`](#automatic-retries-with-maxspotattempts) +- [Fine-grained retries with `errorStrategy`](#fine-grained-retries-with-errorstrategy) + +### Automatic retries with `maxSpotAttempts` + +The simplest approach uses `maxSpotAttempts` to automatically retry any task that fails due to spot reclamation, regardless of the specific failure reason. When you enable Fusion Snapshots, Nextflow automatically sets `maxSpotAttempts = 5`. This allows the checkpoint to be restored on a new instance after reclamation up to 5 times. + +**Increase retries** + +If you experience frequent Spot reclamations, increase `maxSpotAttempts` above `5`: + +- AWS Batch: + + ```groovy + aws.batch.maxSpotAttempts = 10 + ``` + +- Google Cloud Batch: + + ```groovy + google.batch.maxSpotAttempts = 10 + ``` + +**Disable retries** + +To disable automatic retries, set `maxSpotAttempts = 0`: + +- AWS Batch: + + ```groovy + aws.batch.maxSpotAttempts = 0 + ``` + +- Google Cloud Batch: + + ```groovy + google.batch.maxSpotAttempts = 0 + ``` + +### Fine-grained retries with `errorStrategy` + +For fine-grained control of retries, configure your Nextflow [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to implement retry logic based on specific checkpoint failure types. This allows you to handle different failure scenarios (e.g., checkpoint dump failures differently from restore failures) differently. + +To configure, set to `maxSpotAttempts = 0` and add an [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) to your process configuration. For example: + +```groovy +process { + maxRetries = 2 + errorStrategy = { + if (task.exitStatus == 175) { + return 'retry' // Retry checkpoint dump failures + } else { + return 'terminate' // Don't retry other failures + } + } +} +``` + +**Exit codes**: + +- `175`: Checkpoint dump failed — The snapshot could not be saved (e.g., insufficient memory, I/O errors). +- `176`: Checkpoint restore failed — The snapshot could not be restored on the new instance. + +**Configuration options**: + +See [`errorStrategy`](https://www.nextflow.io/docs/latest/reference/process.html#errorstrategy) for more configuration options. + +## TCP connection handling + +By default, Fusion Snapshots use `established` mode to preserve TCP connections during checkpoint operations. This works well for plain TCP connections. If your application uses SSL/TLS connections (HTTPS, SSH, etc.), you need to configure TCP close mode because CRIU cannot preserve encrypted connections. + +To close all TCP connections during checkpoint operations, set: + +```groovy +process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close' +``` + +**Options:** + +- `established`: Preserve TCP connections (default). +- `close`: Close all TCP connections during checkpoint. + +## Debug logging + +By default, Fusion Snapshots use `WARN` level logging (warnings and errors only). If you are troubleshooting checkpoint issues, you can enable more detailed logging to help diagnose problems. + +To enable debug logging, set: + +```groovy +process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug' +``` + +**Log levels**: + +- `ERROR`: Only critical errors +- `WARN`: Warnings and errors (default) +- `INFO`: General informational messages +- `DEBUG`: Detailed debug information + +:::warning +Use `debug` logging only when troubleshooting. It is verbose and may impact performance. +::: + +## Resource limits + +By default, tasks can request any amount of resources. If a task requests more resources than are available on a single instance, the job waits indefinitely and never runs. Use the `process.resourceLimits` directive to set maximum requested resources below the capacity of a single instance. + +Setting resource limits ensures tasks can checkpoint successfully and prevents jobs from becoming unschedulable. For example: + +```groovy +// AWS Batch example (120-second reclamation window) +process.resourceLimits = [cpus: 32, memory: '60.GB'] + +// Google Cloud Batch example (Up to 30-second reclamation window - more conservative) +process.resourceLimits = [cpus: 16, memory: '20.GB'] +``` + +See [AWS Batch](./aws.md) or [Google Cloud Batch](./gcp.md) for more information about reclamation windows. See [`resourceLimits`](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits) for more configuration options. diff --git a/fusion_docs/guide/snapshots/gcp.md b/fusion_docs/guide/snapshots/gcp.md new file mode 100644 index 000000000..f78f5352a --- /dev/null +++ b/fusion_docs/guide/snapshots/gcp.md @@ -0,0 +1,44 @@ +--- +title: Google Cloud Batch +description: "Fusion Snapshots configuration and best practices for Google Cloud Batch" +date created: "2024-11-29" +last updated: "2025-12-19" +tags: [fusion, fusion-snapshots, storage, compute, snapshot, gcp, google, batch] +--- + +Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Cloud Batch preemptible instances. When a preemption occurs, Google Batch provides up to 30 seconds before instance termination. + +:::note +When using Google Cloud Batch, Fusion Snapshots is currently only available for Seqera Cloud. +::: + +:::warning +Google Cloud [guarantees only up to 30 seconds](https://cloud.google.com/compute/docs/instances/spot) before instance termination. Careful instance selection and conservative memory planning are critical for successful checkpoints. +::: + +## Seqera Platform compute environment requirements + +Fusion Snapshots require the following Seqera Platform compute environment configuration: + +- **Provider**: Google Batch +- **Work directory**: GCS bucket in the same region as compute resources +- **Fusion**: Enabled +- **Wave**: Enabled +- **Fusion Snapshots (beta)**: Enabled +- **Provisioning model**: Spot + +:::tip Configuration +Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For configuration options, see [Advanced configuration](./configuration.md). +::: + +## Incremental snapshots + +[Incremental snapshots](./index.md#incremental-snapshots) are enabled by default on x86_64 instances and capture only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window. Use x86_64 instances to enable incremental snapshots. + +## Resource limits + +A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the `process.resourceLimits` directive in your Nextflow configuration. See [Resource limits](./configuration.md#resource-limits) for more information. + +## Manual cleanup + +The `/fusion` folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary. diff --git a/fusion_docs/guide/snapshots/index.md b/fusion_docs/guide/snapshots/index.md new file mode 100644 index 000000000..1db32a300 --- /dev/null +++ b/fusion_docs/guide/snapshots/index.md @@ -0,0 +1,57 @@ +--- +title: Fusion Snapshots +description: "Introduction to Fusion Snapshots checkpoint/restore functionality" +date created: "2024-11-29" +last updated: "2025-12-19" +tags: [fusion, fusion-snapshots, storage, snapshot, checkpoint, restore] +--- + +Fusion Snapshots enable checkpoint/restore functionality for Nextflow pipeline processes running on cloud Spot/preemptible instances. When a cloud provider reclaims an instance, Fusion Snapshots creates a checkpoint of the running process and restores it on a new instance, allowing the process to resume exactly where it left off. + +Key benefits of Fusion Snapshots include: + +- **Cost savings**: Use Spot instances without risk of lost work. +- **Time efficiency**: Resume from interruption point instead of restarting tasks. +- **Resource optimization**: Avoid recomputing completed work. +- **Automatic operation**: Your pipelines require no code changes. + +## Cloud provider support + +Fusion Snapshots is available for the following cloud providers: + +- **[AWS Batch with Spot instances](./aws.md)**: 120-second guaranteed reclamation window. +- **[Google Batch with preemptible instances](./gcp.md)**: Up to 30-second reclamation window. + +## Incremental snapshots + +Incremental snapshots optimize performance by capturing only changed memory pages between checkpoints. This reduces snapshot time and data transfer. Fusion Snapshots automatically perform incremental snapshots on `x86_64` instances. + +Key features of incremental snapshots include: + +- **Pre-dumps**: Captures only changed memory pages since the last checkpoint. +- **Full dumps**: Complete process state captured periodically. +- **Automatic**: Enabled by default, no configuration needed. +- **Efficient**: Reduces checkpoint time and data transfer. + +## How Fusion Snapshots work + +Fusion Snapshots use [CRIU](https://criu.org/) (Checkpoint Restore in Userspace) to capture the complete state of a running process, including: + +- Process memory +- Open files and file descriptors +- Process tree and relationships +- Execution state + +When the system detects a Spot instance interruption: + +1. The system freezes the process and creates a snapshot of its state. +1. Snapshot data is kept in sync with remote object storage via Fusion. +1. On a new instance, the process state is downloaded and restored. +1. The process continues execution from the exact point it was interrupted. + +## Get started + +To get started with your cloud provider, see: + +- [AWS Batch](./aws.md) +- [Google Cloud Batch](./gcp.md) diff --git a/fusion_docs/sidebar.json b/fusion_docs/sidebar.json index d7dd79832..037b2813d 100644 --- a/fusion_docs/sidebar.json +++ b/fusion_docs/sidebar.json @@ -12,7 +12,17 @@ "guide/azure-batch", "guide/aws-eks", "guide/gcp-gke", - "guide/snapshots", + { + "type": "category", + "label": "Fusion Snapshots", + "link": {"type": "doc", "id": "guide/snapshots/index"}, + "collapsed": true, + "items": [ + "guide/snapshots/aws", + "guide/snapshots/gcp", + "guide/snapshots/configuration" + ] + }, { "type": "category", "label": "Local execution", diff --git a/fusion_docs/troubleshooting.md b/fusion_docs/troubleshooting.md index e20c712fa..2f9f0f3c1 100644 --- a/fusion_docs/troubleshooting.md +++ b/fusion_docs/troubleshooting.md @@ -1,11 +1,323 @@ --- title: Troubleshooting +description: "Troubleshooting for Fusion issues" +date created: "2025-11-29" +last updated: "2025-12-19" +tags: [troubleshooting, fusion, fusion-snapshots, configuration] --- -## Too many open files +## General -If you're experiencing an error about too many open files, increase the `ulimit` for the container. Append the following to your Nextflow configuration: +### Too many open files + +**Issue** + +Tasks fail with an error about too many open files. + +**Cause** + +The default file descriptor limit is too low for the container workload. + +**Solution** + +Increase the `ulimit` for the container. Append the following to your Nextflow configuration: ```groovy process.containerOptions = '--ulimit nofile=1048576:1048576' ``` + +## Fusion Snapshots + +### Exit code `175`: Checkpoint dump failed + +**Issue** + +Task fails with exit code `175`, indicating the checkpoint operation did not complete successfully. + +**Cause** + +1. Checkpoint timeout - The process could not be saved within the reclamation window (typically due to high memory usage). The reclamation windows are: + - AWS Batch: 120 seconds (guaranteed) + - Google Batch: Up to 30 seconds (not guaranteed) + - Other factors: Large number of open file descriptors, complex process trees +2. Insufficient network bandwidth - Cannot upload checkpoint data fast enough. +3. Disk space issues - Not enough local storage for checkpoint files. + +**Solution** + +1. Reduce memory usage: + + - Lower memory requested by tasks + - Process smaller data chunks + - Set `process.resourceLimits` to enforce limits: + + ```groovy + // AWS Batch example + process.resourceLimits = [cpus: 32, memory: '60.GB'] + + // Google Batch example (more conservative for 30s window) + process.resourceLimits = [cpus: 16, memory: '20.GB'] + ``` + +2. Increase network bandwidth: + + - Use instance types with higher guaranteed network bandwidth. + - Ensure memory:bandwidth ratio is appropriate (5:1 or better for AWS). + +3. Enable incremental snapshots (automatic on `x86_64`): + + - Verify you're using `x86_64` architecture: `uname -m` + - Avoid ARM64 instances if checkpoints are failing. + +4. Configure retry strategy: + + ```groovy + process { + maxRetries = 2 + errorStrategy = { + if (task.exitStatus == 175) { + return 'retry' + } else { + return 'terminate' + } + } + } + ``` + +See [AWS Batch instance selection](./guide/snapshots/aws.md#selecting-an-ec2-instance) or [Google Batch best practices](./guide/snapshots/gcp.md) for recommended configurations. + +### Exit code `176`: Checkpoint restore failed + +**Issue** + +Task fails with exit code `176` when attempting to restore from a checkpoint. + +**Cause** + +1. Corrupted checkpoint - Previous checkpoint did not complete properly. +2. Missing checkpoint files - Checkpoint data missing or inaccessible in object storage. +3. State conflict - Attempting to restore while dump still in progress. +4. Environment mismatch - Different environment between checkpoint and restore. + +**Solution** + +1. Check if previous checkpoint completed: + - Review logs for "Dumping finished successfully". + - If the "Dumping finished successfully" message is missing, it means the previous checkpoint timed out with a `175` exit error. + +2. Verify checkpoint data exists: + - Check that the `.fusion/dump/` work directory contains checkpoint files. + - Ensure that the S3/GCS bucket is accessible. + - If the bucket is missing, open a support ticket. See [Getting help](#getting-help) for more information. + +3. Configure retry for dump failures first: + - Handle exit code `175` with retry. See [Retry handling](./guide/snapshots/configuration.md#retry-handling) for more information. + +### Long checkpoint times + +**Issue** + +Checkpoints take longer than expected, approaching timeout limits. + +**Cause** + +1. High memory usage - Memory is typically the primary factor affecting checkpoint time. +2. ARM64 architecture - Only full dumps available (no incremental snapshots). +3. Insufficient network bandwidth - Instance bandwidth too low for memory size. +4. Open file descriptors - Large number of open files or complex process trees. + +**Solution** + +1. For AWS Batch (120-second window): + - Use instances with 5:1 or better memory:bandwidth ratio. + - Use `x86_64` instances for incremental snapshot support (`c6id`, `m6id`, `r6id` families). + - Check architecture: `uname -m` + +2. For Google Batch (30-second window): + - Use `x86_64` instances (mandatory for larger workloads). + - Use more conservative memory limits. + - Consider smaller instance types with better ratios. + +3. Review instance specifications: + - Verify guaranteed network bandwidth (not "up to" values). + - Prefer NVMe storage instances on AWS (instances with `d` suffix). + +See [Selecting an EC2 instance](./guide/snapshots/aws.md#selecting-an-ec2-instance) for detailed recommendations. + +### Frequent checkpoint failures + +**Issue** + +Checkpoints consistently fail across multiple tasks. + +**Cause** + +1. Task too large for reclamation window - Memory usage exceeds what can be checkpointed in time (more common on Google Batch with 30-second window). +2. Network congestion or throttling - Bandwidth lower than instance specifications. +3. ARM64 architecture limitations - Full dumps only, requiring much more time and bandwidth. + +**Solution** + +1. Split large tasks: + - Break into smaller, checkpointable units. + - Process data in chunks. + +2. Switch to `x86_64` instances: + - Essential for Google Batch. + - Recommended for AWS Batch tasks > 40 GiB. + +3. Adjust memory limits: + ```groovy + // For AWS Batch + process.resourceLimits = [cpus: 32, memory: '60.GB'] + + // For Google Batch (more conservative) + process.resourceLimits = [cpus: 16, memory: '20.GB'] + ``` + +### SSL/TLS connection errors after restore + +**Issue** + +Applications fail after restore with connection errors, especially HTTPS connections. + +**Cause** + +CRIU cannot preserve encrypted TCP connections (SSL/TLS). + +**Solution** + +Configure TCP close mode to drop connections during checkpoint: + +```groovy +process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close' +``` + +Applications will need to re-establish connections after restore. See [TCP connection handling](./guide/snapshots/configuration.md#tcp-connection-handling) for more information. + +### Debugging workflow + +To diagnose checkpoint problems: + +1. Check the exit code to identify the failure type: + + - **Exit code `175`**: Checkpoint dump failed - The snapshot could not be saved. + - **Exit code `176`**: Checkpoint restore failed - The snapshot could not be restored. + - **Other exit codes**: Likely an application error, not snapshot-related. + +1. Review task logs: + + - Check `.command.log` in the task work directory for Fusion Snapshots messages (prefixed with timestamps). + + :::tip + Enable `debug` logging for more details. + + ```groovy + process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug' + ``` + ::: + +1. Inspect your checkpoint data: + + 1. Open the `.fusion/dump/` folder: + + ```console + .fusion/dump/ + ├── 1/ # First dump + │ ├── pre_*.log # Pre-dump log (if incremental) + │ └── + ├── 2/ # Second dump + │ ├── pre_*.log + │ └── + ├── 3/ # Third dump (full) + │ ├── dump_*.log # Full dump log + │ ├── restore_*.log # Restore log (if restored) + │ └── + └── dump_metadata # Metadata tracking all dumps + ``` + + 1. For incremental dumps (PRE type), check for success markers at the end of the `pre_*.log` file: + + ```console + (66.525687) page-pipe: Killing page pipe + (66.563939) irmap: Running irmap pre-dump + (66.610871) Writing stats + (66.658902) Pre-dumping finished successfully + ``` + + 1. For full dumps (FULL type), check for success markers at the end of the `dump_*.log` file: + + ```console + (25.867099) Unseizing 90 into 2 + (27.160829) Writing stats + (27.197458) Dumping finished successfully + ``` + + 1. If the log ends abruptly without success message, check the last timestamp: + + ```console + (121.37535) Dumping path for 329 fd via self 353 [/path/to/file.tmp] + (121.65146) 90 fdinfo 330: pos: 0x4380000 flags: 100000/0 + # Log truncated - instance was reclaimed before dump completed + ``` + + - AWS Batch: Timestamps near 120 seconds indicate instance terminated during dump. + - Google Batch: Timestamps near 30 seconds indicate instance terminated during dump. + + **Cause**: Task memory too large or bandwidth too low for reclamation window. + + 1. For restore operations, check for a success marker at the end of the `restore_*.log` file: + + ```console + (145.81974) Running pre-resume scripts + (145.81994) Restore finished successfully. Tasks resumed. + (145.82001) Writing stats + ``` + +1. Verify your configuration + +Confirm your environment is properly configured: + +- Instance type has sufficient network bandwidth. +- Memory usage is within safe limits for your cloud provider. +- Architecture is `x86_64` (not ARM64) if experiencing issues. +- Fusion Snapshots are enabled in your compute environment. + +1. Test with different instance types. If uncertain: + + - Run the same task with different instance types that have better disk iops and bandwidth guarantees and verify if Fusions Snapshots work there. + - Decrease memory usage to a manageable amount. + +### Getting help + +When contacting Seqera support about Fusion Snapshots issues, provide the following information to help diagnose the problem: + +1. **Task information**: + + - Nextflow version + - Cloud provider (AWS Batch or Google Cloud Batch) + - Instance type used + - Memory and CPU requested + - Linux kernel version + +2. **Error details**: + + - Exit code (especially `175` or `176` for snapshot failures) + - Task logs from the work directory (`.command.log`) + - Fusion Snapshots logs (if available) + - Timestamp of failure + +3. **Configuration**: + + - Compute environment settings in Platform + - Nextflow config related to Fusion Snapshots (`fusion.snapshots.*` settings) + - Architecture (`x86_64` or ARM64) + +4. **Dump data** (if available): + + Diagnostic data from snapshot operations can help identify the root cause: + + - Preferred: Complete `.fusion/dump/` directory from the task work directory. + - Minimum: The `dump_metadata` file and all `*.log` files from numbered dump folders. + + If the directory is too large to share, prioritize the metadata and log files over the full checkpoint data.