Skip to content

Make snapshot default spot #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 35 additions & 14 deletions 02_setup_compute/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This directory contains YAML configuration files for the creation of two compute environments:

- `aws_fusion_nvme.yml`: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses Fusion V2 with the 6th generation intel instance type with NVMe storage.
- `aws_fusion_nvme.yml`: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses Fusion V2 on SPOT instances with the 6th generation intel instance type with NVMe storage and the Fusion snapshot feature activated. Fusion snapshots is a new feature in Fusion that allows you to snapshot and restore your machine when a spot interruption occurs.
- `aws_plain_s3.yml`: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses the plain AWS Batch with S3 storage.

These YAML files provide best practice configurations for utilizing these two storage types in AWS Batch compute environments. The Fusion V2 configuration is tailored for high-performance workloads leveraging NVMe storage, while the plain S3 configuration offers a standard setup for comparison and workflows that don't require the advanced features of Fusion V2.
Expand All @@ -24,6 +24,23 @@ These YAML files provide best practice configurations for utilizing these two st
- You have an S3 bucket for the Nextflow work directory.
- You have reviewed and updated the environment variables in [env.sh](../01_setup_environment/env.sh) to match your specific AWS setup.

### Using existing manual AWS queues in your compute environments

#### Setting manual queues during CE creation with seqerakit

In the event that you are not standing up your compute queues using Batch Forge but use a manual setup approach, you will need to modify your YAML configurations. You need to change `config-mode: forge` to `config-mode: manual` and add the following lines pointing to your specific queues to the YAML files.

```
head-queue: "myheadqueue-head"
compute-queue: "mycomputequeue-work"
```

Please note that in the case of manual queues the resource labels will have to be attached to your queues already and setting them on the Seqera Platform during CE creation when using manual queues will not work.

#### Manually setting the launch template for Fusion

If you are not using Batch Forge to set up your queues, you will also have to manually set the launch template for your instances in your fusion queues. To do this, add the launch template we provide [Fusion launch template](./fusion_launch_template.txt) to your AWS batch account, then clone your existing AWS compute environment and during the Instance configuration step, choose the fusion launch template you created.

### YAML format description

#### 1. Environment Variables in the YAML
Expand All @@ -44,53 +61,57 @@ Using these variables allows easy customization of the compute environment confi

#### 2. Fusion V2 Compute Environment

If we inspect the contents of [`aws_fusion_nvme.yml`](./compute-envs/aws_fusion_nvme.yml) as an example, we can see the overall structure is as follows:
Fusion snapshots is a new feature in Fusion that allows you to snapshot and restore your machine when a spot interruption occurs. If we inspect the contents of [`./compute-envs/aws_fusion_snapshots.yml`](./compute-envs/aws_fusion_snapshots.yml) as an example, we can see the overall structure is as follows:

```yaml
```YAML
compute-envs:
- type: aws-batch
config-mode: forge
name: "$COMPUTE_ENV_PREFIX_fusion_nvme"
name: "${COMPUTE_ENV_PREFIX}_fusion_snapshots"
workspace: "$ORGANIZATION_NAME/$WORKSPACE_NAME"
credentials: "$AWS_CREDENTIALS"
region: "$AWS_REGION"
work-dir: "$AWS_WORK_DIR"
wave: True
fusion-v2: True
fast-storage: True
snapshots: True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neato burrito

no-ebs-auto-scale: True
provisioning-model: "SPOT"
instance-types: "c6id,m6id,r6id"
instance-types: "c6id.4xlarge,c6id.8xlarge,r6id.2xlarge,m6id.4xlarge,c6id.12xlarge,r6id.4xlarge,m6id.8xlarge"
max-cpus: 1000
allow-buckets: "$AWS_COMPUTE_ENV_ALLOWED_BUCKETS"
labels: storage=fusionv2,project=benchmarking"
wait: "AVAILABLE"
overwrite: False
```
<details>
<summary>Click to expand: YAML format explanation</summary>

The top-level block `compute-envs` mirrors the `tw compute-envs` command. The `type` and `config-mode` options are seqerakit specific. The nested options in the YAML correspond to options available for the Seqera Platform CLI command. For example, running `tw compute-envs add aws-batch forge --help` shows options like `--name`, `--workspace`, `--credentials`, etc., which are provided to the `tw compute-envs` command via this YAML definition.
You should note it is very similar to the Fusion V2 compute environment, but with the following differences:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense anymore if we're ONLY using Fusion snapshots?


- `provisioning-model` is set to `SPOT` to enable the use of spot instances.
- `snapshots` is set to True to allow Fusion to automatically restore a job if interrupted by spot reclamation
- `instance-types` are set to a very restrictive set of types that have sufficient memory and bandwidth to snapshot the machine within the time limit imposed by AWS during a spot reclamation event.

</details>
Note: When setting `snapshots: True`, Fusion, Wave and fast-instance storage will be enabled by default for the CE. We have set these to `true` here for documentation purposes and consistency.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that true or a front end thing? I would rephrase it to be fast storage and Wave are required for Fusion.


#### Pre-configured Options in the YAML

We've pre-configured several options to optimize your Fusion V2 compute environment:
We've pre-configured several options to optimize your Fusion snapshots compute environment:

| Option | Value | Purpose |
|--------|-------|---------|
| `wave` | `True` | Enables Wave, required for Fusion in containerized workloads |
| `fusion-v2` | `True` | Enables Fusion V2 |
| `fast-storage` | `True` | Enables fast instance storage with Fusion v2 for optimal performance |
| `snapshots` | `True` | Enables automatic snapshot creation and restoration for spot instance interruptions |
| `no-ebs-auto-scale` | `True` | Disables EBS auto-expandable disks (incompatible with Fusion V2) |
| `provisioning-model` | `"SPOT"` | Selects cost-effective spot pricing model |
| `instance-types` | `"c6id,m6id,r6id"` | Selects 6th generation Intel instance types with high-speed local storage |
| `instance-types` | `"c6id.4xlarge,c6id.8xlarge,`<br>`r6id.2xlarge,m6id.4xlarge,`<br>`c6id.12xlarge,r6id.4xlarge,`<br>`m6id.8xlarge"` | Selects instance types with a small enough memory footprint and fast enough network to snapshot the machine within the time limit imposed by AWS during a spot reclamation event. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `instance-types` | `"c6id.4xlarge,c6id.8xlarge,`<br>`r6id.2xlarge,m6id.4xlarge,`<br>`c6id.12xlarge,r6id.4xlarge,`<br>`m6id.8xlarge"` | Selects instance types with a small enough memory footprint and fast enough network to snapshot the machine within the time limit imposed by AWS during a spot reclamation event. |
| `instance-types` | `"c6id.4xlarge,c6id.8xlarge,`<br>`r6id.2xlarge,m6id.4xlarge,`<br>`c6id.12xlarge,r6id.4xlarge,`<br>`m6id.8xlarge"` | Selects instance types with small memory and fast network to snapshot within AWS's time limit during spot reclamation. |

| `max-cpus` | `1000` | Sets maximum number of CPUs for this compute environment |

These options ensure your Fusion V2 compute environment is optimized for performance and cost-effectiveness.
These options ensure your Fusion V2 compute environment is optimized for compatibility with the snapshot feature.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These options ensure your Fusion V2 compute environment is optimized for compatibility with the snapshot feature.
These options ensure your Fusion V2 compute environment is optimized.


#### 2. Plain S3 Compute Environment
#### 3. Plain S3 Compute Environment

Similarly, if we inspect the contents of [`aws_plain_s3.yml`](./compute-envs/aws_plain_s3.yml) as an example, we can see the overall structure is as follows:

Expand Down Expand Up @@ -169,7 +190,7 @@ We will additionally use process-level labels for further granularity, this is d
To add labels to your compute environment:

1. In the YAML file, locate the `labels` field.
2. Add your desired labels as a comma-separated list of key-value pairs. We have pre-populated this with the `storage=fusion|plains3` and `project=benchmarking` labels for better organization.
2. Add your desired labels as a comma-separated list of key-value pairs. We have pre-populated this with the `storage=fusion|plains3` and `project=benchmarking` labels for better organization. If you have a pre-existing label, you can use this here as well. For example, if you have previously used the `project` label and it is activated in AWS, you could use `project=fusion_poc_plainS3CE` and `project=fusion_poc_fusionCE` to distinguish the two compute environments.

### Networking
If your compute environments require custom networking setup using a custom VPC, subnets, and security groups, these can be added as additional YAML fields.
Expand Down
1 change: 1 addition & 0 deletions 02_setup_compute/compute-envs/aws_fusion_nvme.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ compute-envs:
wave: True
fusion-v2: True
fast-storage: True
snapshots: True
no-ebs-auto-scale: True
provisioning-model: "SPOT"
instance-types: "c6id,m6id,r6id"
Expand Down
83 changes: 83 additions & 0 deletions 02_setup_compute/fusion_launch_template.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/cloud-config; charset="us-ascii"

#cloud-config
write_files:
- path: /root/tower-forge.sh
permissions: 0744
owner: root
content: |
#!/usr/bin/env bash
## Stop the ECS agent if running
systemctl stop ecs
exec > >(tee /var/log/tower-forge.log|logger -t BatchForge -s 2>/dev/console) 2>&1
##
yum install -q -y jq sed wget unzip nvme-cli lvm2
curl -s https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm -o amazon-cloudwatch-agent.rpm
rpm -U ./amazon-cloudwatch-agent.rpm
rm -f ./amazon-cloudwatch-agent.rpm
curl -s https://nf-xpack.seqera.io/amazon-cloudwatch-agent/config-v0.4.json \
| sed 's/$FORGE_ID/ambry-example/g' \
> /opt/aws/amazon-cloudwatch-agent/bin/config.json
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
mkdir -p /scratch/fusion
NVME_DISKS=($(nvme list | grep 'Amazon EC2 NVMe Instance Storage' | awk '{ print $1 }'))
NUM_DISKS=$${#NVME_DISKS[@]}
if (( NUM_DISKS > 0 )); then
if (( NUM_DISKS == 1 )); then
mkfs -t xfs $${NVME_DISKS[0]}
mount $${NVME_DISKS[0]} /scratch/fusion
else
pvcreate $${NVME_DISKS[@]}
vgcreate scratch_fusion $${NVME_DISKS[@]}
lvcreate -l 100%FREE -n volume scratch_fusion
mkfs -t xfs /dev/mapper/scratch_fusion-volume
mount /dev/mapper/scratch_fusion-volume /scratch/fusion
fi
fi
chmod a+w /scratch/fusion
mkdir -p /etc/ecs
echo ECS_IMAGE_PULL_BEHAVIOR=once >> /etc/ecs/ecs.config
echo ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true >> /etc/ecs/ecs.config
echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
echo ECS_CONTAINER_CREATE_TIMEOUT=10m >> /etc/ecs/ecs.config
echo ECS_CONTAINER_START_TIMEOUT=10m >> /etc/ecs/ecs.config
echo ECS_CONTAINER_STOP_TIMEOUT=10m >> /etc/ecs/ecs.config
echo ECS_MANIFEST_PULL_TIMEOUT=10m >> /etc/ecs/ecs.config
systemctl stop docker
## install AWS cli 2
curl "https://awscli.amazonaws.com/awscli-exe-linux-$(arch).zip" -o "awscliv2.zip"
unzip -q awscliv2.zip
sudo ./aws/install
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCEID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id)
X_ZONE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -fs http://169.254.169.254/latest/meta-data/placement/availability-zone)
AWS_DEFAULT_REGION="`echo \"$X_ZONE\" | sed 's/[a-z]$//'`"
VOLUMEID=$(aws --region $AWS_DEFAULT_REGION ec2 describe-instances --instance-id $INSTANCEID | jq -r .Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId)
aws --region $AWS_DEFAULT_REGION ec2 modify-volume --volume-id $VOLUMEID --size 100 --volume-type gp3 --throughput 325
i=1; until [ "$(aws --region $AWS_DEFAULT_REGION ec2 describe-volumes-modifications --volume-id $VOLUMEID --filters Name=modification-state,Values="optimizing","completed" | jq '.VolumesModifications | length')" == "1" ] || [ $i -eq 256 ]; do
sleep $i
i=$(( i * 2 ))
done
if [ $i -eq 256 ]; then
echo "ERROR expanding EBS boot disk size"
aws --region $AWS_DEFAULT_REGION ec2 describe-volumes-modifications --volume-id $VOLUMEID
fi
growpart /dev/xvda 1
xfs_growfs -d /
systemctl start docker
systemctl enable --now --no-block ecs
echo "1258291200" > /proc/sys/vm/dirty_bytes
echo "629145600" > /proc/sys/vm/dirty_background_bytes

runcmd:
- bash /root/tower-forge.sh

--//--
1 change: 1 addition & 0 deletions 03_setup_pipelines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- You have setup a Fusion V2 and plain S3 compute environment in the Seqera Platform in the [previous section](../02_setup_compute/README.md).
- You have created an S3 bucket for saving the workflow outputs.
- For effective use of resource labels, you have setup Split Cost Allocation tracking in your AWS account and activated the tags as mentioned in [this guide](../docs/assets/aws-split-cost-allocation-guide.md).
- **Exception**: In the event you cannot activate the resource labels we suggest here, but you can utilize existing resource labels, make sure you have set individual unique resource labels for both the plainS3 and Fusion at the compute environment level (See [02_setup_compute](../02_setup_compute/README.md#Appendix) for details)
- If using private repositories, you have added your GitHub (or other VCS provider) credentials to the Seqera Platform workspace.
- You have reviewed and updated the environment variables in [env.sh](../01_setup_environment/env.sh) to match your specific Platform setup.

Expand Down
11 changes: 0 additions & 11 deletions 03_setup_pipelines/pipelines/nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,7 @@ process {
uniqueRunId: System.getenv("TOWER_WORKFLOW_ID"),
pipelineProcess: task.process.toString(),
pipelineTag: task.tag.toString(),
pipelineCPUs: task.cpus.toString(),
pipelineMemory: task.memory.toString(),
pipelineTaskAttempt: task.attempt.toString(),
pipelineContainer: task.container.toString(),
taskHash: task.hash.toString(),
pipelineUser: workflow.userName.toString(),
pipelineRunName: workflow.runName.toString(),
pipelineSessionId: workflow.sessionId.toString(),
pipelineResume: workflow.resume.toString(),
pipelineRevision: workflow.revision.toString(),
pipelineCommitId: workflow.commitId.toString(),
pipelineRepository: workflow.repository.toString(),
pipelineName: workflow.manifest.name.toString()
]}
}
5 changes: 4 additions & 1 deletion 05_generate_report/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,12 @@ The YAML configurations utilize environment variables defined in the `env.sh` fi
Beside these environment variables, there are a few nextflow parameters that need to be configured based on your setup. Go directly in to `./pipelines/nextflow.config` and modify the following variables:

1) If you are an enterprise customer, please change `seqera_api_endpoint` to your Seqera Platform deployment URL. The person who set up your Enterprise deployment will know this address.

2) Set `benchmark_aws_cur_report` to the AWS CUR report containing the cost information for your runs. You can provide the direct S3 path to this file if your credentials in Seqera Platform have access to this file. Otherwise, please upload the parquet report to a S3 bucket accessible by the AWS credentials associated with your compute environment.

> **Exception**: If you cannot use the resource labels we suggested, leave `benchmark_aws_cur_report` set to null and compile the report without task level costs. The cost comparison will be done at the pipeline level via your Cost Explorer access.

> **Note**: If you are using a Seqera Platform Enterprise instance that is secured with a private CA SSL certificate not recognized by default Java certificate authorities, you will need to amend the params section in the [nf-aggregate.yml](../launch/nf-aggregate-launch.yml) file before running the above seqerakit command, to specify a custom cacerts store path through `--java_truststore_path` and optionally, a password with the `--java_truststore_password` pipeline parameters. This certificate will be used to achieve connectivity with your Seqera Platform instance through API and CLI.
2) Set `benchmark_aws_cur_report` to the AWS CUR report containing your runs cost information. This can be the direct S3 link to this file if your credentials in Seqera Platform have access to this file, otherwise, please upload the parquet report to a bucket accesible by the AWS credentials associated with your compute environment.

### 4. Add the samplesheet to Seqera Platform
To add the samplesheet to Seqera Platform, run the following command:
Expand Down
2 changes: 1 addition & 1 deletion 05_generate_report/pipelines/nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ params {
seqera_api_endpoint = 'https://api.cloud.seqera.io'
generate_benchmark_report = true
benchmark_aws_cur_report = null
remove_failed_tasks = true
remove_failed_tasks = false
}
1 change: 0 additions & 1 deletion 05_generate_report/pre-run.txt

This file was deleted.

Loading