Skip to content

Conversation

@OldaKodym
Copy link

@OldaKodym OldaKodym commented Jun 17, 2025

A self-installing script with cmd interface for ClearML local server instance live backups.
It allows you to create and restore ClearML snapshots. It supports backing up Elasticsearch, MongoDB, Redis, and fileserver data without shutting down the server. It also allows scheduling backups using cron jobs.

Simply run it with --help and go from there.
image

I tested it on a dummy server instance as well as on multi-TB instance while tasks were running and successfully restored the snapshots in both instances. Still, there may be some loose ends or edge cases left to handle. Would be glad for any feedback.

@jkhenning
Copy link
Member

Hi @OldaKo,

This looks really cool! How did you test it?

@OldaKo
Copy link

OldaKo commented Jun 22, 2025

Hello @jkhenning

It was a pretty manual process where I created a dummy task with random scalar chart, debug sample image and a textfile artifact. I also kept an infinite task running just to check it doesn't break things. I let that run, run the snapshot creation, kill the server and delete everything in /opt/clearml. Then I reinitialized the server, copied the backed-up config and docker-compose.yaml, run the docker network of the new instance and run the snapshot restoration. Scalars, debug samples and artifacts all appeared with no apparent issues.

Something along these lines

import time
from clearml import Task
import numpy as np
from PIL import Image
import tempfile

def main():
    # Create a new ClearML task
    task = Task.init(project_name="testing_project", task_name="testing_task")

    # Log 100 random scalars
    for i in range(100):
        value = np.random.rand()
        task.get_logger().report_scalar(title="random_testing_scalar", series="test_series", value=value, iteration=i)

    # Create a random image and log it
    random_image = (np.random.rand(128, 128, 3) * 255).astype(np.uint8)
    img = Image.fromarray(random_image)
    task.get_logger().report_image(title="random_testing_image", series="sample", iteration=0, image=img)

    # Create a random artifact file and upload it

    with tempfile.NamedTemporaryFile(delete=False, suffix=".txt") as tmp_file:
        tmp_file.write(b"This is a random artifact file.\n")
        tmp_file_path = tmp_file.name

    task.upload_artifact(name="testing_artifact", artifact_object=tmp_file_path)
    
    # Create a long running task
    task = Task.init(project_name="testing_project", task_name="testing_task_long")

    # Log random scalars every 10 seconds
    i = 0
    while True:
        value = np.random.rand()
        task.get_logger().report_scalar(title="random_testing_scalar", series="test_series", value=value, iteration=i)
        i += 1
        time.sleep(10)

if __name__ == "__main__":
    main()

I repeated similar process for our live server, although I skipped the fileserver to save some time since that's just rsyncing files back and forth anyway. Our ES instance has ~80GB and I let a testing task run again to check that shards don't get corrupted by continuous logging while the snapshot is being created. Again, after deleting and restoration, everything seems to be in order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants