Skip to content

refactor(worker): use revision_exists from huggingface_hub to check branch existence #3203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions services/worker/src/worker/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,9 @@
from urllib.parse import quote

import PIL
import requests
from datasets import Dataset, DatasetInfo, DownloadConfig, Features, IterableDataset, load_dataset
from datasets.utils.file_utils import SINGLE_FILE_COMPRESSION_EXTENSION_TO_PROTOCOL
from huggingface_hub import HfFileSystem, HfFileSystemFile
from huggingface_hub import HfFileSystem, HfFileSystemFile, revision_exists
from huggingface_hub.errors import RepositoryNotFoundError
from huggingface_hub.hf_api import HfApi
from libcommon.constants import CONFIG_SPLIT_NAMES_KIND, MAX_COLUMN_NAME_LENGTH
Expand Down Expand Up @@ -176,11 +175,12 @@ def retry_on_arrow_invalid_open_file(

def create_branch(dataset: str, target_revision: str, hf_api: HfApi, committer_hf_api: HfApi) -> None:
try:
refs = retry(on=[requests.exceptions.ConnectionError], sleeps=LIST_REPO_REFS_RETRY_SLEEPS)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need LIST_REPO_REFS_RETRY_SLEEPS in the code base? If not, we should remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need LIST_REPO_REFS_RETRY_SLEEPS in the code base? If not, we should remove it

I guess we do not need it, since the list_repo_refs retry logic is no longer used after switching to revision_exists, I’ve removed LIST_REPO_REFS_RETRY_SLEEPS.

hf_api.list_repo_refs
)(repo_id=dataset, repo_type=DATASET_TYPE)
if all(ref.ref != target_revision for ref in refs.converts):
# Check if the target revision (branch) already exists
if not revision_exists(dataset, target_revision):
# If not, get the latest commit from the main branch (or current default)
initial_commit = hf_api.list_repo_commits(repo_id=dataset, repo_type=DATASET_TYPE)[-1].commit_id

# Create a new branch at the latest commit
committer_hf_api.create_branch(
repo_id=dataset, branch=target_revision, repo_type=DATASET_TYPE, revision=initial_commit, exist_ok=True
)
Expand Down
Loading