-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
What happened + What you expected to happen
As part of a .cmd script on windows, we're starting up ray with CALL ray start --head --num-cpus 64
. Afterwards, the script spins up a service that makes use of ray, and stays alive indefinitely.
We noticed that, after moving to ray, our resource usage on the machine massively increased. Most obvious is that Ray seems to leak file handles at an alarming rate - we've reproduced this on a few windows machines, all of which happen to be using 2.33.0. Ray is also leaking memory, albeit more slowly, and we have only had one machine taken down by this.
If it helps, we observed file handler leaks on windows 10 enterprise version 22H2, and file handler + memory leaks on Microsoft Windows Server 2019 Standard 10.0.17763. It's entirely possible that we would observe memory leaks on the windows enterprise machines, but those machines are running ray with far fewer cores.
Details
Fortunately, we have extremely detailed system logs on one of the machines that exhibits these issues. I'm very happy to give as much information as required to help solve the issue, but our data is sensitive, so I'll have to clean everything first.
The machine from which the following data was extracted was the windows server machine. This data is associated with a single ray worker, of which there were 64 on that machine.
The dip in resource usage on September 1st at around 2pm came from full restart of ray, required after the system ran out of memory.
For context, our usage of ray was extremely light during this period. I've included the working_set_peak
plot primarily because it shows when we're actually using ray quite nicely. The little bump on Aug 27th was light usage between midnight and around 5am. It was then nearly completely unused until about 6:30am on Aug 31st.
Versions / Dependencies
2.33.0
Reproduction script
ray start --head --num-cpus 64
Issue Severity
High: It blocks me from completing my task.