Skip to content

[Core] Enormous file handler and memory leaks on windows #47459

@RBrearton

Description

@RBrearton

What happened + What you expected to happen

As part of a .cmd script on windows, we're starting up ray with CALL ray start --head --num-cpus 64. Afterwards, the script spins up a service that makes use of ray, and stays alive indefinitely.

We noticed that, after moving to ray, our resource usage on the machine massively increased. Most obvious is that Ray seems to leak file handles at an alarming rate - we've reproduced this on a few windows machines, all of which happen to be using 2.33.0. Ray is also leaking memory, albeit more slowly, and we have only had one machine taken down by this.

If it helps, we observed file handler leaks on windows 10 enterprise version 22H2, and file handler + memory leaks on Microsoft Windows Server 2019 Standard 10.0.17763. It's entirely possible that we would observe memory leaks on the windows enterprise machines, but those machines are running ray with far fewer cores.

Details

Fortunately, we have extremely detailed system logs on one of the machines that exhibits these issues. I'm very happy to give as much information as required to help solve the issue, but our data is sensitive, so I'll have to clean everything first.

The machine from which the following data was extracted was the windows server machine. This data is associated with a single ray worker, of which there were 64 on that machine.

The dip in resource usage on September 1st at around 2pm came from full restart of ray, required after the system ran out of memory.

For context, our usage of ray was extremely light during this period. I've included the working_set_peak plot primarily because it shows when we're actually using ray quite nicely. The little bump on Aug 27th was light usage between midnight and around 5am. It was then nearly completely unused until about 6:30am on Aug 31st.

image
image
image
image

Versions / Dependencies

2.33.0

Reproduction script

ray start --head --num-cpus 64

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-object-storewindows

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions