'Azure Machine Learning pipeline: How to retry upon failure?

So I've got an Azure Machine Learning pipeline here that consists of a number of PythonScriptStep tasks - pretty basic really.

Some of these script steps fail intermittently due to network issues or somesuch - really nothing unexpected. The solution here is always to simply trigger a rerun of the failed experiment in the browser interface of Azure Machine Learning studio.

Despite my best efforts I haven't been able to figure out how to set a retry parameter either on the script step objects, the pipeline object, or any other AZ ML-related object. This is a common pattern in pipelines of any sort: Task fails once - retry a couple of times before deciding it actually fails.

Does anyone have pointers for me please?

Edit: One helpful user suggested an external solution to this which requires an Azure Logic App that listens to ML pipeline events and re-triggers failed pipelines via an HTTP request. While this solution may work for some it just takes you down another rabbit hole of setting up, debugging, and maintaining another external component. I'm looking for a simple "retry upon task failure" option that (IMO) must be baked into the Azure ML pipeline framework and is hopefully just poorly documented.



Solution 1:[1]

I assume that if a script fails, you want to rerun the entire pipeline. In that case, it is pretty simple with Logic Apps. What you need is the following:

  1. You need to make a PipelineEndpoint for your pipeline so it can be triggered by something outside Azure ML.
  2. You need to set up a Logic App to listen for failed runs. See the following: https://medium.com/geekculture/notifications-on-azure-machine-learning-pipelines-with-logic-apps-5d5df11d3126. Instead of printing a message to Microsoft Teams as in that example, you instead invoke your pipeline through its endpoint.

Solution 2:[2]

(this would ideally be a comment but it exceeded the word limit)

@user787267's answer above help me set up the re-try pipeline. So I thought I'd add a few more details that might help someone else set this up.

How to set up the HTTP action

Method: POST 
URI: The pipeline endpoint that you configured 
Headers: `Key`: Content-Type -- `Value`: application/json 
Body: 
{ 
  "ExperimentName": "my_experiment_name", 
  "ParameterAssignments": { 
        "param1": "value1", 
        "param2": "value2" }, 
  "RunSource": "SDK" 
} 
Authentication Type: Managed Identity
Managed Identity: System-assigned managed identity

You can set up the managed identity by going to the logic app's page and then clicking on the Identity tab as shown below. After that just follow the steps. You'll need to give the managed identity permissions over the space in which your ML instance lives.

tab for identity

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user787267
Solution 2 David Clarance