'Timeout in multi-machine training of pytorch?
Error occurs in multi-machine training of pytorch:
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
I expand timeout limit to 3 days, same error occurs still.
How to deal with that? Thanks~
dist.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size, rank=args.rank,
timeout=datetime.timedelta(days=3)
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|