Skip to content

Commit 697472c

Browse files
committed
exp: retry removing temporary refs to avoid FileLocked errors
We have per-ref rwlocks set. But they do not prevent conflicts when refs are stored in packed-refs, as multiple processes could be modifying the same file simultaneously. On #10673, this is triggering `FileLocked` errors. Wrap ref removal in a retry loop (10 attempts, 0.1s delay) to mitigate these race conditions.
1 parent a68dda4 commit 697472c

File tree

1 file changed

+15
-3
lines changed
  • dvc/repo/experiments/executor

1 file changed

+15
-3
lines changed

dvc/repo/experiments/executor/base.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from itertools import chain
1010
from typing import TYPE_CHECKING, Any, Callable, NamedTuple, Optional, Union
1111

12+
import funcy
1213
from funcy import nullcontext
1314
from scmrepo.exceptions import SCMError
1415

@@ -815,11 +816,22 @@ def _copy_path(src, dst):
815816

816817
@contextmanager
817818
def set_temp_refs(self, scm: "Git", temp_dict: dict[str, str]):
819+
# Retry ref set, get, and remove operations to handle transient issues during
820+
# concurrent Git access.
821+
# Dulwich deletes parent directories of refs if they happen to be empty after
822+
# removing a ref, which can interfere with `set_ref` in other processes.
823+
# `remove_ref` may also fail with a `FileLocked` error when refs are packed,
824+
# since multiple processes might attempt to write to the same file.
825+
retry = funcy.retry(10, errors=Exception, timeout=0.1)
826+
set_ref = retry(scm.set_ref)
827+
get_ref = retry(scm.get_ref)
828+
remove_ref = retry(scm.remove_ref)
829+
818830
try:
819831
for ref, rev in temp_dict.items():
820-
scm.set_ref(ref, rev)
832+
set_ref(ref, rev)
821833
yield
822834
finally:
823835
for ref in temp_dict:
824-
if scm.get_ref(ref):
825-
scm.remove_ref(ref)
836+
if get_ref(ref):
837+
remove_ref(ref)

0 commit comments

Comments
 (0)