'Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often). The sql query just performs create table as select from registered temp view from DataFrame.

The first idea was spark.catalog.clearCache() in the end of the job (did't help). Also I found some post on databricks forum about using object ... extends App (Scala) instead of main method (didn't help again)

P.S. current_date() is the built-in function and it should be provided automatically (expected)

Spark 2.4.4, Scala 2.11, Databricks Runtime 6.2

org.apache.spark.sql.AnalysisException: Undefined function: 'current_date'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 21 pos 4
    at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
    at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1317)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1309)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)```


Solution 1:[1]

Solution, ensure spark initialized every time when job is executed.

TL;DR, I had similar issue and that object extends App solution pointed me in right direction. So, in my case I was creating spark session outside of the "main" but within object and when job was executed first time cluster/driver loaded jar and initialised spark variable and once job has finished execution successfully (first time) the jar is kept it in memory but link to spark is lost for some reason and any subsequent execution does not reinitialize spark as jar is already loaded and in my case spark initilisation was outside main and hence was not re-initilised. I think it's not an issue for Databricks jobs that create cluster and run or start cluster before execution (as these are similar to first time start case) and only related to clusters that already up and running as jars are loaded during either cluster start up or job execution. So, I moved spark creation i.e. SparkSession.builder()...getOrCreate() to the "main" and so when job called so does spark session gets reinitialized.

Solution 2:[2]

current_date() is the built-in function and it should be provided automatically (expected)

This expectation is wrong. you have to import the functions

for scala

import org.apache.spark.sql.functions._

where current_date function is available.

from pyspark.sql import functions as F

for pyspark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Ram Ghadiyaram