seqeralabs
diff --git a/‎fusion_docs/guide/snapshots.md‎ ‎fusion_docs/guide/snapshots/aws.md‎fusion_docs/guide/snapshots.md renamed to fusion_docs/guide/snapshots/aws.md
Lines changed: 77 additions & 46 deletions b/‎fusion_docs/guide/snapshots.md‎ ‎fusion_docs/guide/snapshots/aws.md‎fusion_docs/guide/snapshots.md renamed to fusion_docs/guide/snapshots/aws.md
Lines changed: 77 additions & 46 deletions
diff --git a/‎fusion_docs/guide/snapshots/configuration.md‎
Lines changed: 128 additions & 0 deletions b/‎fusion_docs/guide/snapshots/configuration.md‎
Lines changed: 128 additions & 0 deletions
diff --git a/‎fusion_docs/guide/snapshots/gcp.md‎
Lines changed: 51 additions & 0 deletions b/‎fusion_docs/guide/snapshots/gcp.md‎
Lines changed: 51 additions & 0 deletions
@@ -1,43 +1,90 @@
 ---
 title: Fusion Snapshots for AWS Batch
-description: "Overview of the Fusion Snapshots feature for AWS Batch"
-date: "23 Aug 2024"
-tags: [fusion, storage, compute, file system, snapshot]
+description: "Fusion Snapshots configuration and best practices for AWS Batch"
+date: "21 Nov 2024"
+tags: [fusion, storage, compute, snapshot, aws, batch]
 ---
 
-Fusion Snapshots enable you to create a checkpoint of a running Nextflow process and restore it on a different machine. Leveraging the Fusion file system, you can move the task between different instances connected to the same S3 bucket.
+Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a **guaranteed 120-second warning window** to checkpoint and save the task state before the instance terminates.
 
-More specifically, the first use case for this feature is for Seqera Platform users to leverage AWS Spot instances without restarting an entire task from scratch if an instance is terminated. When a Spot instance interruption occurs, the task is restored from the last checkpoint on a new AWS instance, saving time and computational resources.
+## Requirements
 
-### Seqera Platform compute environment requirements
-
-Fusion Snapshots v1 requires the following [Seqera compute environment](https://docs.seqera.io/platform-cloud/compute-envs/aws-batch) configuration:
+### Seqera Platform compute environment
 
 - **Provider**: AWS Batch
-- **Work directory**: An S3 bucket located in the same region as your AWS Batch compute resources
+- **Work directory**: S3 bucket in the same region as compute resources
 - **Enable Fusion Snapshots (beta)**
 - **Config mode**: Batch Forge
-- **Provisioning model**: Spot 
-- **Instance types**: See recommended instance sizes below
+- **Provisioning model**: Spot
+- **Instance types**: See recommendations below
+
+:::tip Configuration
+Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For advanced configuration options like changing retry behavior or TCP handling, see the [Configuration Guide](configuration.md).
+:::
+
+### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI
+
+To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI.
+
+:::note
+Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default.
+:::
+
+To find the recommended AL2023 ECS-optimized AMI for your region, run the following (replace eu-central-1 with your AWS region):
+
+
+```bash
+export REGION=eu-central-1
+aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
+```
+
+The result for the `eu-central-1` region is similar to the following:
+
+```json
+{
+    "Parameter": {
+        "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
+        "Type": "String",
+        "Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
+        "Version": 61,
+        "LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
+        "ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
+        "DataType": "text"
+}
+```
+
+Note the `image_id` in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment.
 
-When Fusion Snapshots are enabled, the Nextflow Spot reclamation retry setting automatically defaults to `aws.batch.maxSpotAttempts = 5`. 
+## Incremental snapshots
 
-### EC2 instance selection guidelines
+Incremental snapshots are enabled by default on amd64 instances, capturing only changed memory pages between checkpoints for faster operations and reduced data transfer.
+
+## EC2 instance selection
+
+AWS provides a **guaranteed 120-second reclamation window**. Choose instances that can transfer checkpoint data within this time frame. Checkpoint time is primarily determined by memory usage, though other factors like number of open file descriptors also contribute.
+
+### Key guidelines
+
+- Select instances with **guaranteed** network bandwidth (not "up to" values)
+- Maintain a **5:1 ratio** between memory (GiB) and network bandwidth (Gbps)
+- Prefer **NVMe storage** instances (`d` suffix: `c6id`, `r6id`, `m6id`)
+- Use **x86_64** instances for incremental snapshot support
+
+### Recommended instance types
 
 - Choose EC2 Spot instances with sufficient memory and network bandwidth to dump the cache of task intermediate files to S3 storage before AWS terminates an instance.
 - Select instances with guaranteed network bandwidth (not instances with bandwidth "up to" a maximum value).
 - Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
 - Recommended instance families: `c6id`, `r6id`, or `m6id` series instances work optimally with Fusion fast instance storage.
 
-:::info Example
+:::info With incremental snapshots
 A c6id.8xlarge instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds, well within the 2-minute reclamation window.
 
 Instances with memory:bandwitdth ratios over 5:1 may not complete transfers before termination, potentially resulting in task failures.
 :::
 
-#### Recommended instance types
 
-| Instance type  | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. Snapshot time|
+| Instance type  | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Est. snapshot time|
 |----------------|-------|--------------|--------------------------|------------------------|-------------------|
 | c6id.4xlarge   | 16    | 32           | 12.5                     | 2.56:1                 | ~45 seconds       |
 | c6id.8xlarge   | 32    | 64           | 12.5                     | 5.12:1                 | ~70 seconds       |
@@ -47,42 +94,26 @@ Instances with memory:bandwitdth ratios over 5:1 may not complete transfers befo
 | r6id.4xlarge   | 16    | 128          | 12.5                     | 10.24:1                | ~105 seconds      |
 | m6id.8xlarge   | 32    | 128          | 25                       | 5.12:1                 | ~70 seconds       |
 
-It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a c6id.8xlarge machine, you could set the following:
-
-```groovy
-process.resourceLimits = [cpus: 32, memory: '60.GB']
-```
 
-Note that if a process requests the maximum CPUs and memory in the table above, it will not be satisfiable by a single instance and therefore fail to be assigned to a machine.
+### Resource limits
 
-### (Seqera Enterprise only) Select an Amazon Linux 2023 ECS-optimized AMI 
+It's possible for a single job to request more resources than are available on a single instance. In this case, the job will wait indefinitely and never progress to running. To prevent this from occurring, you should set the maximum resources requested to below the size of a single instance listed above. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration. For example, to limit a single process to fit within a `c6id.8xlarge` machine, you could set the following:
 
-To obtain sufficient performance, Fusion Snapshots require instances with Amazon Linux 2023 (which ships with Linux Kernel 6.1), with an ECS Container-optimized AMI. 
 
-:::note
-Selecting a custom Amazon Linux 2023 ECS-optimized AMI is only required for compute environments in Seqera Enterprise deployments. Seqera Cloud AWS Batch compute environments use Amazon Linux 2023 AMIs by default. 
-:::
+```groovy
+process.resourceLimits = [cpus: 32, memory: '60.GB']
+```
 
-To find the recommended AL2023 ECS-optimized AMI for your region, run the following (replace `eu-central-1` with your AWS region):
+## Best practices
 
-```bash 
-export REGION=eu-central-1
-aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGION
-```
+- **Instance selection**: Prefer instances with 5:1 or lower memory:bandwidth ratio
+- **Architecture**: Use x86_64 instances to enable incremental snapshots
 
-The result for the `eu-central-1` region is similar to the following:
+## Troubleshooting
 
-```bash 
-{
-    "Parameter": {
-        "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
-        "Type": "String",
-        "Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
-        "Version": 61,
-        "LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
-        "ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
-        "DataType": "text"
-    }
-```
+- **Exit code 175**: Dump failed, likely due to timeout. Reduce memory usage or increase bandwidth.
+- **Exit code 176**: Restore failed. Check logs and verify checkpoint data integrity.
+- **Long checkpoint times**: Review instance bandwidth, consider x86_64 for incremental snapshots.
+- **State stuck in DUMPING**: Previous checkpoint exceeded reclamation window.
 
-Note the `image_id` in your result (in this example, `ami-0281c9a5cd9de63bd`). Specify this ID in the **AMI ID** field under **Advanced options** when you create your Seqera compute environment. 
+For detailed troubleshooting, see [Troubleshooting Guide](troubleshooting.md).
@@ -0,0 +1,128 @@
+---
+title: Fusion Snapshots Configuration
+description: "Advanced configuration options for Fusion Snapshots"
+date: "22 Nov 2024"
+tags: [fusion, snapshot, configuration, nextflow]
+---
+
+Fusion Snapshots are designed with sensible defaults and typically require no additional configuration. The settings below are provided for edge cases and advanced use scenarios.
+
+:::note
+You likely don't need to change these settings. Fusion Snapshots work out of the box with optimal defaults for most workloads.
+:::
+
+## Spot reclamation retries
+
+Control how many times Nextflow automatically retries a task after spot instance reclamation.
+
+**Default with Fusion Snapshots enabled**: `5` (automatic retries on spot reclamation)
+**Default without Fusion Snapshots**: `0` (no automatic retries)
+
+:::info
+When Fusion Snapshots are enabled, Nextflow automatically sets `maxSpotAttempts = 5` to enable automatic retry on spot reclamation. This allows the checkpoint to be restored on a new instance after reclamation.
+:::
+
+### AWS Batch
+
+```groovy
+aws.batch.maxSpotAttempts = 10  // Increase retries beyond default of 5
+aws.batch.maxSpotAttempts = 0   // Disable automatic retries
+```
+
+### Google Batch
+
+```groovy
+google.batch.maxSpotAttempts = 10  // Increase retries beyond default of 5
+google.batch.maxSpotAttempts = 0   // Disable automatic retries
+```
+
+**When to customize**:
+- Increase above 5 if you expect frequent spot reclamations
+- Set to 0 if you want to handle retries only through error strategies
+- Most users should keep the default (5)
+
+## Error retry strategy
+
+Configure how Nextflow handles checkpoint failures. This is the recommended approach for retry logic.
+
+```groovy
+process {
+    maxRetries = 2
+    errorStrategy = {
+        if (task.exitStatus == 175) {
+            return 'retry'  // Retry on checkpoint dump failure
+        } else {
+            return 'terminate'
+        }
+    }
+}
+```
+
+**Exit codes**:
+- `175`: Checkpoint dump failed
+- `176`: Restore failed
+
+**Why this is better**: Using error strategy gives you fine-grained control over retry behavior per exit code, while `maxSpotAttempts` retries all spot reclamations regardless of the cause.
+
+## TCP connection handling
+
+Control how TCP connections are handled during checkpoint operations.
+
+**Default**: `established` (preserve TCP connections)
+
+```groovy
+process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close'
+```
+
+**Options**:
+- `established`: Preserve TCP connections (default, for plain TCP)
+- `close`: Close all TCP connections during checkpoint
+
+**When to use**: Set to `close` if your application uses SSL/TLS connections (HTTPS, SSH, etc.), as CRIU cannot preserve encrypted connections.
+
+## Debug logging
+
+Enable detailed logging for troubleshooting.
+
+**Default**: `WARN` (warnings and errors only)
+
+```groovy
+process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
+```
+
+**Log levels**:
+- `ERROR`: Only critical errors
+- `WARN`: Warnings and errors (default)
+- `INFO`: General informational messages
+- `DEBUG`: Detailed debug information
+
+**When to use**: Only for troubleshooting checkpoint issues. Debug logs are verbose and may impact performance.
+
+## Resource limits
+
+Prevent tasks from requesting resources that exceed available capacity.
+
+```groovy
+// AWS Batch example
+process.resourceLimits = [cpus: 32, memory: '60.GB']
+
+// Google Batch example (more conservative for 30s window)
+process.resourceLimits = [cpus: 16, memory: '20.GB']
+```
+
+**When to use**:
+- Prevent jobs from becoming unschedulable
+- Ensure tasks can checkpoint within reclamation windows
+- Enforce organization-wide resource policies
+
+See [AWS Batch](aws.md#best-practices) or [Google Batch](gcp.md#best-practices) for recommended memory limits.
+
+## Summary
+
+Most users never need to modify these settings. Fusion Snapshots are designed to work optimally with default configuration. Only adjust these settings if:
+
+- You have specific organizational policies
+- You're experiencing issues with the default behavior
+- You have edge case requirements not covered by defaults
+
+For most troubleshooting scenarios, focus on task memory usage and instance selection rather than configuration changes.
@@ -0,0 +1,51 @@
+---
+title: Fusion Snapshots for Google Batch
+description: "Fusion Snapshots configuration and best practices for Google Batch"
+date: "21 Nov 2024"
+tags: [fusion, storage, compute, snapshot, gcp, google, batch]
+---
+
+Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on Google Batch preemptible instances. When a preemption occurs, the task is checkpointed and restored on a new instance.
+
+:::warning Short and variable reclamation window
+Google Batch provides **up to 30 seconds** before instance termination (not guaranteed - could be less), significantly shorter than AWS Batch's guaranteed 120 seconds. Careful instance selection and conservative memory planning are critical for successful checkpoints.
+:::
+
+## Requirements
+
+### Seqera Platform compute environment
+
+- **Provider**: Google Batch
+- **Work directory**: GCS bucket in the same region as compute resources
+- **Enable Fusion Snapshots (beta)**
+- **Provisioning model**: Spot
+
+:::tip Configuration
+Fusion Snapshots work with sensible defaults (5 automatic retry attempts). For advanced configuration options like changing retry behavior or TCP handling, see the [Configuration Guide](configuration.md).
+:::
+
+## Incremental snapshots
+
+Incremental snapshots are enabled by default on X86_64 instances, capturing only changed memory pages between checkpoints. This is particularly beneficial for Google Batch's shorter reclamation window.
+
+### Key guidelines
+
+- Use **x86_64** instances (incremental snapshots enabled by default)
+
+
+### Resource limits
+
+It's possible for a single job to request more resources than are available on a single instance. This can be configured using the `process.resourceLimits` directive in your Nextflow configuration.
+
+
+
+```groovy
+process.resourceLimits = [cpus: 16, memory: '20.GB']
+```
+
+## Troubleshooting
+
+- **Exit code 175**: Dump failed, likely due to timeout because too much memory is used and cannot be dumped fast enough.
+- **Exit code 176**: Restore failed. Check logs and verify checkpoint data integrity.
+
+For detailed troubleshooting, see [Troubleshooting Guide](troubleshooting.md).