valueerror: error initializing torch.distributed using env:// rendezvous: environment variable master_addr expected, but not set

#68

by mahi22muki - opened Jul 25, 2023

Jul 25, 2023

I am trying to run the script over 2 server (each 4GPU*2) using mpirun with horovod.
I am facing this error. Rank , world size , local size is not getting detected automatically , master_add and port also not getting fetched.

Help me up to resolve the error .
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
tried with different approached , nothing worked out.

RylanSchaeffer

Apr 27, 2024

Did you ever find a solution?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment