Ansible based playbooks for the deployment and orchestration of the Conda Compute Cluster.
Conda Compute Cluster (CCC) has been developed by ViCoS UL, FRI to enable deep learning researches a seamless migration between different GPU servers when working on specific projects. Main features of Conda Compute Cluster are:
- Running multiple docker containers on different hosts simultainously.
- Seamless transition from one host to another.
- SSH access to containers through reverse proxy (FRP proxy).
- Designed for running Conda enviroment on NVIDIA GPUs for deep learning research.
Containers are based on Conda Compute Containers that enable seamless transition from one host to another due to:
- Home folder mounted on common shared storage .
- Forbidden modification of non-home files.
- Users can modify certain propreties from withing container:
- can modify container image (must be based on
vicos/ccc-base:latest) - can modify apt packages, repositories and soruces installed at container boot
- can modify on which hosts to deploy containers
- can modify container image (must be based on
- Pre-installed Miniconda on /home/USER/Conda.
Cluster management is done through a single ansible script and enables deployment of the following features:
- Automatic deployment of containers uppon change of config.
- Using NFS with FS-cache for shared storage.
- Management of local disk with ZFS.
- Harwdware monitoring and management:
- automatic management of system FANs when using SUPERMICRO server based on GPU temperature (Sperfans GPU Controller)
- monitoring of GPU and CPU reported as Prometheus metrics
- monitoring of GPU usage for automatic reservation using patroller
Two playbooks are available that deploy Conda Compute Cluster and Containers:
cluster-deploy.yml: deployment of cluster infrastructure (network, docker, FRP client, ZFS, NFS, FS-Cache, HW monitoring, GPU fan controlers, etc.)containers-deploy.yml: depyloment of compute containers based on Conda Compute Container (CCC) images
Run the following command to deploy the infrastructure:
ansible-playbook cluster-deploy.yml -i <path-to-inventory> \
--vault-password-file <path-to-secret> -e vars_file=<path-to-secret-vars-dir> \
-e machines=<node-or-group-pattern> \
-e only_roles=<list of roles> You can specifcy the cluster definition in the supplied inventory folder. See sample-inventory for example. Tasks are deployed on the nodes defined by the -e machines=<node-or-group-pattern>.
By default all roles are executed in the order as specifid below. Deployment can be limited to only specific roles by supplying -e only_roles=<list of roles>. List of roles can be comma seperated list of role names:
netplan: network intrface definition using netplandocker: docker with pre-defined docker neworks, repository logins and portrainer agent for GUI managementfrp-client: FRP client for access to containers through the proxy serverzfs: ZFS pools for local storagecachefilesd: FS-Cache for caching of the NFS storage into local scratch disksnfs-storage: NFS storage for shared storage (needed for shared/home/userover all compute nodes)superfan-gpu: superfans GPU controller for regulating SYSTEM FANs based on GPU temperaturemonitoring-agent: HW monitoring for providing Prometheus metrics of CPU and GPUscompute-container-nightwatch: CCC nightwatch for providing automatic updated of the compute container upon changes to to the Ansible config or user-supplied configpatroller: GPU Patroler for automatic GPU reservation system based on https://github.com/vicoslab/patrollersshd-hostkey: not an actual role but a minor task to deploy ssh-daemon keys for CCC containers
Example of how to provide cluster configurations is in the sample-inventory folder that includes:
- hosts definitions:
your-cluster.ymlwithccc-clusteras main group of your cluster nodes - cluster settings:
group_vars/ccc-cluster/cluster-vars.yml - cluster secrets:
vault_vars/cluster-secrets.yml(requires --vault-password-file to unlock) - host-specific settings:
sample-inventory/host_vars
Cluster-wide settings contain principal configuration of the whole cluster and are sectioned into settings for individual roles. Settings are used both by the cluster-deploy.yml and containers-deploy.yml playbooks.
Cluster secrets are stored in a seperate vault_vars folder and should not be in present in group_vars to allow running containers-deploy.yml without needing vault secret. Secrets can be instead loaded for cluster deployment using -e vars_file=<path-to-secret-vars-dir> which will load vars only for cluster-deploy.yml playbook.
Run the following command to deploy compute containers:
ansible-playbook containers-deploy.yml -i <path-to-inventory> \
-e machines=<node-or-group-pattern> \
-e containers=<list of STACK_NAME> \
-e users=<list of USER_EMAIL>By default all containers are deployed!!
To limit the deployment of only specific containers two additional filters can be used. For both filters, the provided values must be a comma separated list in a string format:
-e containers=<list of STACK_NAME>: filters based on containers` STACK_NAME value-e users=<list of USER_EMAIL>: filters based on containers` USER_EMAIL value
List of containers for deployment and list of users are stored need to be set in the inventory configuration:
- yaml variable
deployment_containers: list of containers for deployment (e.g., seegroup_vars/ccc-cluster/user-containers.yml) - yaml variable
deployment_users: list of users for deployment (e.g., seegroup_vars/ccc-cluster/user-list.yml) - yaml variable
deployment_types: list of users types (e.g., seegroup_vars/ccc-cluster/user-list.yml)
Example of how to provide cluster configurations is in the sample-inventory folder that includes:
- list of containers for deployment as
deployment_containersvar ingroup_vars/ccc-cluster/user-containers.yml - list of users for deployment as
deployment_usersvar ingroup_vars/ccc-cluster/user-list.yml - list of users types as
deployment_typesvar ingroup_vars/ccc-cluster/user-list.yml
Each container for depoyment must be provided in deployment_containers variable as an array/list of dictionary with the following keys for each container:
STACK_NAME: name of the compute containersCONTAINER_IMAGE: container image that will be deployed (e.g., "registry.vicos.si/ccc-juypter:ubuntu18.04-cuda10.1")USER_EMAIL: user's emailINSTALL_PACKAGES: additional apt packages that are installed at startup (registry.vicos.si/ccc-base:<...> images do not provide sudo access by default !!)INSTALL_REPOSITORY_KEYS: comma separated list of links to fingerprint keys for installed repositoriy sources (added usingapt-key add)INSTALL_REPOSITORY_SOURCES: comma separated list repositoriy sources (deb ...sources orppalinks that can be added usingadd-apt-repository)SHM_SIZE: shared memory settingsFRP_PORTS:dict()with TCP and HTTP keys with info of the forwarded ports to the FRP serverTCP: a list of tcp ports as string valuesHTTP: a list of http ports asdict()objects withport,subdomain,pass(optional),health_check(optional) andsubdomain_hostname_prefix(optional - bool) keys
User informations can be centralized in separate file for quick reuse. Containers and users are matched based on emails. The following user information must be present within the deployment_containers[<USER_EMAIL>] dictionary:
USER_FULLNAME: user's first and last name (fromUSER_MENTOR: user's mentor (optional)USER_NAME: username for the OSUSER_PUBKEY: SSH public key for access to the compute containreUSER_TYPE: user group/type that restricts network, nodes and GPU devices (groups/types are defined indeployment_typeskey)ADDITIONAL_DEVICE_GROUPS: allowed additional device groups besides ones defined byUSER_TYPE
- setting docker repository login from config
- encrypted data for authentication settings
- can deploy compute-container only to specific group nodes (student or lab nodes) or specific node
- can control deploying compute-container through config
- support for NVIDIA GPU driver installation
- performance tunded NFS mount settings with FS-cache
- custom ZFS storage mounting
- IPMI FAN controler using NVIDIA GPU temperatures (designed for supermicro server)
- centralized storage of users (with thier names, email and PUBKEY) in a single file
- loading of SSH pubkey from GITHUB
- prometheus export for monitoring of the HW (for CPU and GPU - GPU utilization, temperature, etc)
- users can provide custom settings inside of the containers by editing ~/.containers/<STACK_NAME>.yml file
- compute-container-nightwatch that monitors ~/.containers/<STACK_NAME>.yml files and redeploys them using ansible-pull
- constraining to specific GPUs based on device groups and user group
- enable of redirection of container loging output to the user