'Py4JJavaError in an Azure Databricks notebook pipeline

I have a curious issue, when launching a databricks notebook from a caller notebook through dbutils.notebook.run (I am working in Azure Databricks).

Error

One interesting thing I noticed is that when manually launching the inner notebook, everything goes smoothly.

I am also positive that at least one run had been successful even when called by the outer notebook in the exact same conditions. It is likely it never worked when called from outside, see explaination of the issue below.

What is weird is that when I get to view the inner notebook run, I have a pandas related exception (KeyError: "None of [Index(['address'], dtype='object')] are in the [columns]"). But I really don't think that it is related to my code as, like mentioned above, the code works when the inner notebook is run directly. For what it helps, the inner notebook has some heavy pandas computation.

The full visible java stack in the outer notebook is:

Py4JJavaError: An error occurred while calling o1141._run.
: com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED
    at com.databricks.workflow.WorkflowDriver.run(WorkflowDriver.scala:71)
    at com.databricks.dbutils_v1.impl.NotebookUtilsImpl.run(NotebookUtilsImpl.scala:122)
    at com.databricks.dbutils_v1.impl.NotebookUtilsImpl._run(NotebookUtilsImpl.scala:89)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.databricks.NotebookExecutionException: FAILED
    at com.databricks.workflow.WorkflowDriver.run0(WorkflowDriver.scala:117)
    at com.databricks.workflow.WorkflowDriver.run(WorkflowDriver.scala:66)
    ... 13 more

Any help would be welcome, thanks!



Solution 1:[1]

Thanks to @AlexOtt, I identified the origin of my issue.

The main takeaway I would like to share is to double check job parameters passing between the notebooks (and especially the "type cast" that happen with the standard way of passing arguments)

In my specific case, I wanted to pass an integer to the inner notebook but it was converted to string in the process, and was incorrectly taken into account afterwards.

In the outer notebook:

# set up the parameter dict
jobs_params = {
    ...
    'max_accounts': 0,  # set to 0 to parse all the accounts sent    
}

# call the inner notebook
dbutils.notebook.run("./01_JSON_Processing", 1800, jobs_params)

In the inner notebook:

arg_list = [
    ...
    'max_accounts',
]

v = dict()

for arg in arg_list:
    dbutils.widgets.text(arg, "", "")
    v[arg] = dbutils.widgets.get(arg)

Checking the type of v['max_accounts'] showed that it had been converted to a string in the process (and further computation resulted in the KeyError exception).

I did not identify the issue as when debugging the inner notebook, I just copy/pasted the job_params values in the inner notebook, but this did not reproduce the casting of max_accounts as a string in the process.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Pierre Massé