'Horovod Timeline and MPI Tracing in Azure Machine Learning Workspace(MPI Configuration)

All,
I am trying to train a distributed model using Horovod on Azure Machine Learning Service as shown below.

estimator = TensorFlow(source_directory=script_folder,
                       entry_script='train_script.py',
                       script_params=script_params,
                       compute_target=compute_target_gpu_4,
                       conda_packages=['scikit-learn'],                       
                       node_count=2,                        
                       distributed_training=MpiConfiguration(),
                       framework_version = '1.13',
                       use_gpu=True
                      )
run = exp.submit(estimator)
  • How to enable Horovod timeline?
  • How to enable more detailed MPI tracing to see the communication between the nodes?

Thanks.



Solution 1:[1]

The following uses the Tensorflow Estimator class in the SDK, that distributed_training is set to Mpi().

enter image description here

Another sample using Horovod to train a genism sentence similarity model. https://github.com/microsoft/nlp-recipes/blob/46c0658b79208763e97ae3171e9728560fe37171/examples/sentence_similarity/gensen_train.py

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1