'pyspark delta-lake metastore

Using "spark.sql.warehouse.dir" in the same jupyter session (no databricks) works. But after a kernel restart in jupyter the catalog db and tables arent't recognized anymore. Isn't it possible to have session independence using the metastore logic with delta-lake outside databricks (I know the possibility using path)?

Thanks, Christian


spark = (
    SparkSession.builder
    .appName("tmp")    
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
    .config("spark.sql.warehouse.dir", "/home/user/data")
    .getOrCreate()
)


df = spark.range(100)
df.write.format("delta").mode("overwrite").saveAsTable("rnd")
spark.sql("Select * from rnd").show()
spark.catalog.listDatabases()
spark.catalog.listTables()



Solution 1:[1]

Databricks is using Hive metastore, if you want to have the same workflow as with Databricks, setting up Hive instance is your best option. You can do it locally if you want/need, I checked that, it works.

Other than that you could attempt to play around with spark.catalog to store information on those tables, and databases into some file and then reload each time you start new session, but I wouldn't recommend that.

Anyway, there's no Delta centralized metastore to my knowledge, other than Hive.

UPDATE

Spark has metastore built in, and it defaults to Derby. There's a simple way to define Hive metastore database, that would be managed internally by Spark. With this in place you don't have to set up Hadoop and Hive, all you need is a database. You can configure it for some shared database instance, or set it up locally. My choice was MySQL running in docker, but it's up to you. Below my example of creating new SparkSession with MySQL instance as metastore, but it would make more sense to keep this connection configs in spark-defaults.conf or better yet in some secure location.

spark = SparkSession.builder.appName("metastore test") \
    .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:mysql://localhost:3306/metadata_db") \
    .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "user") \
    .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "password") \
    .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "com.mysql.cj.jdbc.Driver") \
    .config("spark.sql.warehouse.dir", "/path/to/warehouse") \
    .enableHiveSupport().getOrCreate()

NOTES

Just make sure you've created database user that can be accessed from Spark and you have driver for your database.

I was also missing Hive schema in MySQL, so I executed schema creation manually from script that can be found here: https://github.com/apache/hive/tree/master/metastore/scripts/upgrade/mysql

Here's some context on Spark's ExternalCatalog I came across https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-ExternalCatalog.html

Of course you can modify this builder to implement Delta suport.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1