-
Notifications
You must be signed in to change notification settings - Fork 1
Make snapshot default spot #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -2,7 +2,7 @@ | |||||
|
||||||
This directory contains YAML configuration files for the creation of two compute environments: | ||||||
|
||||||
- `aws_fusion_nvme.yml`: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses Fusion V2 with the 6th generation intel instance type with NVMe storage. | ||||||
- `aws_fusion_nvme.yml`: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses Fusion V2 on SPOT instances with the 6th generation intel instance type with NVMe storage and the Fusion snapshot feature activated. Fusion snapshots is a new feature in Fusion that allows you to snapshot and restore your machine when a spot interruption occurs. | ||||||
- `aws_plain_s3.yml`: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses the plain AWS Batch with S3 storage. | ||||||
|
||||||
These YAML files provide best practice configurations for utilizing these two storage types in AWS Batch compute environments. The Fusion V2 configuration is tailored for high-performance workloads leveraging NVMe storage, while the plain S3 configuration offers a standard setup for comparison and workflows that don't require the advanced features of Fusion V2. | ||||||
|
@@ -24,6 +24,23 @@ These YAML files provide best practice configurations for utilizing these two st | |||||
- You have an S3 bucket for the Nextflow work directory. | ||||||
- You have reviewed and updated the environment variables in [env.sh](../01_setup_environment/env.sh) to match your specific AWS setup. | ||||||
|
||||||
### Using existing manual AWS queues in your compute environments | ||||||
|
||||||
#### Setting manual queues during CE creation with seqerakit | ||||||
|
||||||
In the event that you are not standing up your compute queues using Batch Forge but use a manual setup approach, you will need to modify your YAML configurations. You need to change `config-mode: forge` to `config-mode: manual` and add the following lines pointing to your specific queues to the YAML files. | ||||||
|
||||||
``` | ||||||
head-queue: "myheadqueue-head" | ||||||
compute-queue: "mycomputequeue-work" | ||||||
``` | ||||||
|
||||||
Please note that in the case of manual queues the resource labels will have to be attached to your queues already and setting them on the Seqera Platform during CE creation when using manual queues will not work. | ||||||
|
||||||
#### Manually setting the launch template for Fusion | ||||||
|
||||||
If you are not using Batch Forge to set up your queues, you will also have to manually set the launch template for your instances in your fusion queues. To do this, add the launch template we provide [Fusion launch template](./fusion_launch_template.txt) to your AWS batch account, then clone your existing AWS compute environment and during the Instance configuration step, choose the fusion launch template you created. | ||||||
|
||||||
### YAML format description | ||||||
|
||||||
#### 1. Environment Variables in the YAML | ||||||
|
@@ -44,53 +61,57 @@ Using these variables allows easy customization of the compute environment confi | |||||
|
||||||
#### 2. Fusion V2 Compute Environment | ||||||
|
||||||
If we inspect the contents of [`aws_fusion_nvme.yml`](./compute-envs/aws_fusion_nvme.yml) as an example, we can see the overall structure is as follows: | ||||||
Fusion snapshots is a new feature in Fusion that allows you to snapshot and restore your machine when a spot interruption occurs. If we inspect the contents of [`./compute-envs/aws_fusion_snapshots.yml`](./compute-envs/aws_fusion_snapshots.yml) as an example, we can see the overall structure is as follows: | ||||||
|
||||||
```yaml | ||||||
```YAML | ||||||
compute-envs: | ||||||
- type: aws-batch | ||||||
config-mode: forge | ||||||
name: "$COMPUTE_ENV_PREFIX_fusion_nvme" | ||||||
name: "${COMPUTE_ENV_PREFIX}_fusion_snapshots" | ||||||
workspace: "$ORGANIZATION_NAME/$WORKSPACE_NAME" | ||||||
credentials: "$AWS_CREDENTIALS" | ||||||
region: "$AWS_REGION" | ||||||
work-dir: "$AWS_WORK_DIR" | ||||||
wave: True | ||||||
fusion-v2: True | ||||||
fast-storage: True | ||||||
snapshots: True | ||||||
no-ebs-auto-scale: True | ||||||
provisioning-model: "SPOT" | ||||||
instance-types: "c6id,m6id,r6id" | ||||||
instance-types: "c6id.4xlarge,c6id.8xlarge,r6id.2xlarge,m6id.4xlarge,c6id.12xlarge,r6id.4xlarge,m6id.8xlarge" | ||||||
max-cpus: 1000 | ||||||
allow-buckets: "$AWS_COMPUTE_ENV_ALLOWED_BUCKETS" | ||||||
labels: storage=fusionv2,project=benchmarking" | ||||||
wait: "AVAILABLE" | ||||||
overwrite: False | ||||||
``` | ||||||
<details> | ||||||
<summary>Click to expand: YAML format explanation</summary> | ||||||
|
||||||
The top-level block `compute-envs` mirrors the `tw compute-envs` command. The `type` and `config-mode` options are seqerakit specific. The nested options in the YAML correspond to options available for the Seqera Platform CLI command. For example, running `tw compute-envs add aws-batch forge --help` shows options like `--name`, `--workspace`, `--credentials`, etc., which are provided to the `tw compute-envs` command via this YAML definition. | ||||||
You should note it is very similar to the Fusion V2 compute environment, but with the following differences: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this make sense anymore if we're ONLY using Fusion snapshots? |
||||||
|
||||||
- `provisioning-model` is set to `SPOT` to enable the use of spot instances. | ||||||
- `snapshots` is set to True to allow Fusion to automatically restore a job if interrupted by spot reclamation | ||||||
- `instance-types` are set to a very restrictive set of types that have sufficient memory and bandwidth to snapshot the machine within the time limit imposed by AWS during a spot reclamation event. | ||||||
|
||||||
</details> | ||||||
Note: When setting `snapshots: True`, Fusion, Wave and fast-instance storage will be enabled by default for the CE. We have set these to `true` here for documentation purposes and consistency. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that true or a front end thing? I would rephrase it to be fast storage and Wave are required for Fusion. |
||||||
|
||||||
#### Pre-configured Options in the YAML | ||||||
|
||||||
We've pre-configured several options to optimize your Fusion V2 compute environment: | ||||||
We've pre-configured several options to optimize your Fusion snapshots compute environment: | ||||||
|
||||||
| Option | Value | Purpose | | ||||||
|--------|-------|---------| | ||||||
| `wave` | `True` | Enables Wave, required for Fusion in containerized workloads | | ||||||
| `fusion-v2` | `True` | Enables Fusion V2 | | ||||||
| `fast-storage` | `True` | Enables fast instance storage with Fusion v2 for optimal performance | | ||||||
| `snapshots` | `True` | Enables automatic snapshot creation and restoration for spot instance interruptions | | ||||||
| `no-ebs-auto-scale` | `True` | Disables EBS auto-expandable disks (incompatible with Fusion V2) | | ||||||
| `provisioning-model` | `"SPOT"` | Selects cost-effective spot pricing model | | ||||||
| `instance-types` | `"c6id,m6id,r6id"` | Selects 6th generation Intel instance types with high-speed local storage | | ||||||
| `instance-types` | `"c6id.4xlarge,c6id.8xlarge,`<br>`r6id.2xlarge,m6id.4xlarge,`<br>`c6id.12xlarge,r6id.4xlarge,`<br>`m6id.8xlarge"` | Selects instance types with a small enough memory footprint and fast enough network to snapshot the machine within the time limit imposed by AWS during a spot reclamation event. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| `max-cpus` | `1000` | Sets maximum number of CPUs for this compute environment | | ||||||
|
||||||
These options ensure your Fusion V2 compute environment is optimized for performance and cost-effectiveness. | ||||||
These options ensure your Fusion V2 compute environment is optimized for compatibility with the snapshot feature. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
#### 2. Plain S3 Compute Environment | ||||||
#### 3. Plain S3 Compute Environment | ||||||
|
||||||
Similarly, if we inspect the contents of [`aws_plain_s3.yml`](./compute-envs/aws_plain_s3.yml) as an example, we can see the overall structure is as follows: | ||||||
|
||||||
|
@@ -169,7 +190,7 @@ We will additionally use process-level labels for further granularity, this is d | |||||
To add labels to your compute environment: | ||||||
|
||||||
1. In the YAML file, locate the `labels` field. | ||||||
2. Add your desired labels as a comma-separated list of key-value pairs. We have pre-populated this with the `storage=fusion|plains3` and `project=benchmarking` labels for better organization. | ||||||
2. Add your desired labels as a comma-separated list of key-value pairs. We have pre-populated this with the `storage=fusion|plains3` and `project=benchmarking` labels for better organization. If you have a pre-existing label, you can use this here as well. For example, if you have previously used the `project` label and it is activated in AWS, you could use `project=fusion_poc_plainS3CE` and `project=fusion_poc_fusionCE` to distinguish the two compute environments. | ||||||
|
||||||
### Networking | ||||||
If your compute environments require custom networking setup using a custom VPC, subnets, and security groups, these can be added as additional YAML fields. | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
MIME-Version: 1.0 | ||
Content-Type: multipart/mixed; boundary="//" | ||
|
||
--// | ||
Content-Type: text/cloud-config; charset="us-ascii" | ||
|
||
#cloud-config | ||
write_files: | ||
- path: /root/tower-forge.sh | ||
permissions: 0744 | ||
owner: root | ||
content: | | ||
#!/usr/bin/env bash | ||
## Stop the ECS agent if running | ||
systemctl stop ecs | ||
exec > >(tee /var/log/tower-forge.log|logger -t BatchForge -s 2>/dev/console) 2>&1 | ||
## | ||
yum install -q -y jq sed wget unzip nvme-cli lvm2 | ||
curl -s https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm -o amazon-cloudwatch-agent.rpm | ||
rpm -U ./amazon-cloudwatch-agent.rpm | ||
rm -f ./amazon-cloudwatch-agent.rpm | ||
curl -s https://nf-xpack.seqera.io/amazon-cloudwatch-agent/config-v0.4.json \ | ||
| sed 's/$FORGE_ID/ambry-example/g' \ | ||
> /opt/aws/amazon-cloudwatch-agent/bin/config.json | ||
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ | ||
-a fetch-config \ | ||
-m ec2 \ | ||
-s \ | ||
-c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json | ||
mkdir -p /scratch/fusion | ||
NVME_DISKS=($(nvme list | grep 'Amazon EC2 NVMe Instance Storage' | awk '{ print $1 }')) | ||
NUM_DISKS=$${#NVME_DISKS[@]} | ||
if (( NUM_DISKS > 0 )); then | ||
if (( NUM_DISKS == 1 )); then | ||
mkfs -t xfs $${NVME_DISKS[0]} | ||
mount $${NVME_DISKS[0]} /scratch/fusion | ||
else | ||
pvcreate $${NVME_DISKS[@]} | ||
vgcreate scratch_fusion $${NVME_DISKS[@]} | ||
lvcreate -l 100%FREE -n volume scratch_fusion | ||
mkfs -t xfs /dev/mapper/scratch_fusion-volume | ||
mount /dev/mapper/scratch_fusion-volume /scratch/fusion | ||
fi | ||
fi | ||
chmod a+w /scratch/fusion | ||
mkdir -p /etc/ecs | ||
echo ECS_IMAGE_PULL_BEHAVIOR=once >> /etc/ecs/ecs.config | ||
echo ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true >> /etc/ecs/ecs.config | ||
echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config | ||
echo ECS_CONTAINER_CREATE_TIMEOUT=10m >> /etc/ecs/ecs.config | ||
echo ECS_CONTAINER_START_TIMEOUT=10m >> /etc/ecs/ecs.config | ||
echo ECS_CONTAINER_STOP_TIMEOUT=10m >> /etc/ecs/ecs.config | ||
echo ECS_MANIFEST_PULL_TIMEOUT=10m >> /etc/ecs/ecs.config | ||
systemctl stop docker | ||
## install AWS cli 2 | ||
curl "https://awscli.amazonaws.com/awscli-exe-linux-$(arch).zip" -o "awscliv2.zip" | ||
unzip -q awscliv2.zip | ||
sudo ./aws/install | ||
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") | ||
INSTANCEID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id) | ||
X_ZONE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -fs http://169.254.169.254/latest/meta-data/placement/availability-zone) | ||
AWS_DEFAULT_REGION="`echo \"$X_ZONE\" | sed 's/[a-z]$//'`" | ||
VOLUMEID=$(aws --region $AWS_DEFAULT_REGION ec2 describe-instances --instance-id $INSTANCEID | jq -r .Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId) | ||
aws --region $AWS_DEFAULT_REGION ec2 modify-volume --volume-id $VOLUMEID --size 100 --volume-type gp3 --throughput 325 | ||
i=1; until [ "$(aws --region $AWS_DEFAULT_REGION ec2 describe-volumes-modifications --volume-id $VOLUMEID --filters Name=modification-state,Values="optimizing","completed" | jq '.VolumesModifications | length')" == "1" ] || [ $i -eq 256 ]; do | ||
sleep $i | ||
i=$(( i * 2 )) | ||
done | ||
if [ $i -eq 256 ]; then | ||
echo "ERROR expanding EBS boot disk size" | ||
aws --region $AWS_DEFAULT_REGION ec2 describe-volumes-modifications --volume-id $VOLUMEID | ||
fi | ||
growpart /dev/xvda 1 | ||
xfs_growfs -d / | ||
systemctl start docker | ||
systemctl enable --now --no-block ecs | ||
echo "1258291200" > /proc/sys/vm/dirty_bytes | ||
echo "629145600" > /proc/sys/vm/dirty_background_bytes | ||
|
||
runcmd: | ||
- bash /root/tower-forge.sh | ||
|
||
--//-- |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neato burrito