-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray component: Core] Add way to fix problems with ray::IDLE workers taking up too many resources #27499
Comments
(Possibly if there is a state where ray::IDLE workers are actually waiting to pass data or something else, they should be marked ray::Almost_IDLE or something, so that the user can figure out what is happening from looking at the output of ps.) |
We cannot put a hard cap on the number of workers since this can potentially deadlock if you have a deeply nested application, so I'm afraid this feature is unlikely to be supported. The recommended way to impose a soft cap is to use That said, this is definitely abnormal:
Can you please provide a reproduction script so we can dig into this further? |
Hm, I don't have a simple reproduction script, but what we are roughly doing is calling ray.init in one python program, and then it runs multiple ray.remotes, and in the ray remotes, they do a subprocess call to another python program that does another ray.init and then calls its own ray.remotes. I think that is required to get to levels like 90. That said, if you want to see ray::IDLE workers that never get cleaned up, the reproduction scripts in #28071 work. Here they are in one listing: Create environment: conda create --name ray2
conda activate ray2
conda install --name ray2 pip
pip install ray==2.0.* ray_start.sh #!/bin/bash
source /home/fred/miniconda3/etc/profile.d/conda.sh
conda activate ray2
NUM_CPUS=$1
HEAD_ADDRESS=$2
ray start --verbose --address=$HEAD_ADDRESS --num-cpus $NUM_CPUS --min-worker-port 10002 --max-worker-port $((10002+8*$NUM_CPUS)) ssh to one of the nodes to run on, and run this python script: import socket
import time
l = []
for PORT in range(10002, 10010+1):
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('', PORT))
s.listen(1)
l.append(s)
except OSError:
print("unable to get ",PORT)
if len(l) > 0:
input("Press enter to release ports:") Script to start ray ($PBS_NODEFILE lists the nodes to run on) ray start --head --num-cpus=1 --port=0 --min-worker-port 10002 --max-worker-port 10010
#Adjust this based on output of ray start
export RAY_ADDRESS='10.159.5.15:65512'
for NODE in `cat $PBS_NODEFILE | uniq`; do
if echo $NODE | grep -q `hostname`; then
echo skipping $NODE;
else
echo $NODE
ssh $NODE /home/fred/ray_start.sh 1 $RAY_ADDRESS
fi
done Python script to test ray: import ray
import time
import gc
ray.init(address='auto')
start = time.time()
@ray.remote
def f(x):
a = [2**2**2**2**2 for x in range(100000)]
return x * x
print("starting remote functions")
futures = [f.remote(i) for i in range(80)]
print("remotes started")
left = len(futures)
outputs = [None]*left
while left > 0:
for i in range(len(futures)):
if futures[i] is not None:
try:
f = futures[i]
runReturn = ray.get(f, timeout=1e-10)
print(i, runReturn)
left += -1
outputs[i] = runReturn
futures[i] = None
gc.collect()
print("referrers", len(gc.get_referrers(f)))
del f
gc.collect()
except ray.exceptions.GetTimeoutError:
pass
print(outputs) # [0, 1, 4, 9]
end = time.time()
print("time: ",end - start) Basically, the listen_range.py on a node blocks that node from creating workers, and eventually you end up with ray::IDLE on lots of nodes, and no progress being made. I suspect that if this simplified reproduction script shows at least part of the problem that we are encountering that is causing lots of ray::IDLE's. |
And if you want me to try and replicate the 90 idle rays, I can try and make a small replication script that does that. I suspect it is just the above with an outer python program that calls ray.init, and then calls subprocess on an inner that runs computational work. |
Um, correct me if I am wrong, but doesn't this mean that ray potentially deadlocks with a deeply nested application, since any physical machine has some actual hard limit for the number of workers? It sounds like ray may need some way to error out in that case. (This may be difficult to fix...) |
Each time you call The best practice here is to try to keep the whole application under the same driver. In the ray.remote calls, is it possible to call the other tasks directly instead of through a subprocess that calls ray.init()? This should reduce the total number of workers needed to run the job.
It wouldn't deadlock, but yes potentially Ray would crash if you had an extremely deeply nested application. You may bump into this if you have a recursive program that nests to depth ~1000 or more, which in practice we haven't seen very often. Ray's scheduler will try to avoid this situation by prioritizing finishing a nested branch instead of starting a new one. |
I agree in principle that it would be nice to eliminate the subprocess call. We discuss that ~every 6 months, but it is non-trivial. And for what it is worth, the subprocesses do exit, and to get to 90, they probably have to not be cleaned up after exiting. We also set it up so that there should be approximately one task per cpu. That said, the replication in #27499 (comment) does not use a subprocess call, and has ray::IDLE's that take substantially longer than 1 minute to clean up. |
@joshua-cogliati-inl are you okay closing this issue? We cannot provide the requested feature to hard-cap workers, but feel free to open or rename to something else. |
Hm, would "ray workers not being cleaned up if node blocked" be an acceptable renaming? (And if there is another way to do the fault injection quickly besides min/max worker ports I would be happy to try a replication script with that instead.) |
(If #28071 was renamed to "[Core] Ray may hang if workers fail to start", I would be fine with closing this one, because this one is about preventing ray from failing if the workers use too much memory.) |
Okay, the actual problem (as the stack flow question with no good answers has: https://stackoverflow.com/questions/60098624/what-is-rayidle-and-why-are-some-of-the-workers-running-out-of-memory/63231293 ) is that ray can often end up in situations where ray::IDLE workers are taking up a lot of resources. I think this may require fixing various bugs ( #28071 and #28199 to name two.) and possibly improving the ability of users to diagnose why a ray::IDLE is being created. If the stackoverflow had a simple answer like: run |
Alternative reproducer to get long lived ray::IDLE's: import ray
import time
import gc
import threading
ray.init(address='auto')
start = time.time()
class MemUser:
def __init__(self, v):
self.v = v
#[0]*15*1024 takes about 10KB
self.extra = [0]*15*1024*3000
def __str__(self):
return f"MemUser({self.v})"
@ray.remote
def f(x):
a = [2**2**2**2**2 for x in range(100000)]
return MemUser(x * x)
print("starting remote functions")
futures = [f.remote(i) for i in range(19)]
print("remotes started")
left = len(futures)
needed_to_finish = len(futures)
outputs = [None]*left
lowest_running = 0
while lowest_running < needed_to_finish:
lowest_running = len(futures)
for i in range(len(futures)):
if futures[i] is not None:
if i < lowest_running:
lowest_running = i
try:
fi = futures[i]
runReturn = ray.get(fi, timeout=1e-10)
print(i, runReturn, left, len(gc.get_referrers(fi)))
left += -1
outputs[i] = runReturn
futures[i] = None
del fi
except ray.exceptions.GetTimeoutError:
pass
print(outputs) # [0, 1, 4, 9]
print(lowest_running)
end = time.time()
print("time: ",end - start) Basically, this returns a lot of data from the ray remote ( And ray status thinks everything is fine:
|
So ray status goes from:
to
without the ray.get ever either returning a value or throwing an exception or outputting anything related to this to *.err in the logs directories. |
@stephanie-wang Is this worth a separate issue? |
Neat, we got a stack trace with 35 MB return values after waiting long enough (note that it went something like 15 minutes before the "failed to connect to GCS within 60 seconds" happened):
|
Added new issue at: #28855 |
For what it is worth, I created a test where I start ray with process A, and then start handing process B, C, ... the address of the ray server. B and C and ... run some ray process and then B finishes, C finishes, ... and still I have tons of ray::IDLE process from B, C, ... so I think I have ray::IDLE's from processes that finished an hour ago, but still haven't been collected. |
+1 so I still use ray 1.11.1. after testing it's the latest version that I can't reproduce this question. |
@stephanie-wang In reply to the stack overflow comment, is there any way I can debug ray::IDLE's that are continuing after the process exits? (Is there something I should call before exiting a process to tell ray to drop any continuing ray::IDLE's? Is there a way to tell a ray remote I am done with it (besides deleting it?)? |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
For what it is worth, the only way I have found to decrease the amount of resources that ray::IDLE workers use was to switch to dask :( |
We fixed various issues relevant to process leak. We will try repro with ^ script, and close the issue if it is not happening again. |
@scv119 have you had a chance to take a look at this yet? |
@LilDojd thanks for the detailed report!
IDLE procesess are supposed to exist while you are running your script, but not after you terminate your job.
If your job is already finished & you don't use detached actors, it is mostly okay. I think we actually already do this. It is risky to do it while you are running your script because the idle processes can still own some important metadata I have a couple more questions here;
|
|
Hmm actually having idle processes alive while running a script is not an unexpected behavior (it is intended actually). I think the main question here is
|
Description
Ray has
--num-cpus
to limit the number of cpus used on a node, but this does not limit the number ofray::IDLE
workers that exist. So for example after running:ray start --address=10.159.8.149:53454 --num-cpus 4
I have seen over 90ray::IDLE
workers created. Each of these workers uses cpu and memory, which results in significant resource use.On stack overflow, use of
ray.init(local_mode=True)
https://stackoverflow.com/a/63231293/18954005 was suggested, but that basically removes parallelism.An alternative work around is use of
--min-worker-port
and--max-worker-port
to restrict the number, but if ports are already used by some other process, that can cause fewer workers to be created than desired.Use case
I would like to be able to limit the resources used by
ray::IDLE
workers.The text was updated successfully, but these errors were encountered: