-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
A: experimentsRelated to dvc expRelated to dvc expbugDid we break something?Did we break something?help wantedtriageNeeds to be triagedNeeds to be triaged
Description
Bug Report
DVC EXP workers dying
Running multiple workers results in failed experiments and no logs
Description
Launching dvc queue start
with parameter -j
greater than 1 fails some experiments that shouldn't fail and these experiments will have no logs. Furthermore, sometimes the exp-worker dies with the failed experiments.
Reproduce
Example:
params.yaml
value: 1
dvc.yaml
stages:
experiment_candles:
cmd: sleep 5 ; echo DONE
params:
- params.yaml:
git init
dvc init
- Copy dvc.yaml
- Copy params.yaml
git add *.yaml
git commit -m "initial commit"
- Queue experiments
dvc exp run \
--queue \
--set-param "value=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99"
- Start 20 jobs
dvc queue start -j 20
- Check for failed jobs
dvc queue status | grep Failed
- Check logs of failed jobs
dvc queue logs ...
Note: that it doesn't always fail, so maybe you have to iterate starting at step 7.
Output sample

Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
dvc_data = 3.16.7
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.9
Supports:
azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
gdrive (pydrive2 = 1.21.3),
gs (gcsfs = 2024.12.0),
hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
oss (ossfs = 2023.12.0),
s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
ssh (sshfs = 2024.9.0),
webdav (webdav4 = 0.10.0),
webdavs (webdav4 = 0.10.0),
webhdfs (fsspec = 2024.12.0)
Config:
Global: /Users/pichurri/Library/Application Support/dvc
System: /Users/pichurri/homebrew/share/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Users/pichurri/homebrew/var/cache/dvc/repo/7b5c17002f7a7963a4dc1afee2b961e2
Metadata
Metadata
Assignees
Labels
A: experimentsRelated to dvc expRelated to dvc expbugDid we break something?Did we break something?help wantedtriageNeeds to be triagedNeeds to be triaged