'Pyspark Py4JJavaError while creating the delta-table
Here is the pyspark code which is running on jupyter notebook.
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
Error :
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x30cb5b99) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x30cb5b99
configuration:
- delta-spark=1.1.0
- pyspark=3.2.0
- Java version: openjdk 17.0.1 2021-10-19 OpenJDK Runtime Environment Homebrew (build 17.0.1+1) OpenJDK 64-Bit Server VM Homebrew (build 17.0.1+1, mixed mode, sharing)
.bash_profile :
export HADOOP_HOME=/opt/hadoop-2.8.0
export SPARK_HOME=/opt/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
please help me to resolve the error. Thanks in advance .
Solution 1:[1]
Note: My English is very poor and I use deepl.com to translate my native language articles and I try to use as much code as possible.
Installing pyspark3.1 and pyspark3.2 with pip or conda will have some problems, for reasons I can't explore, the solutions are as follows.
- Install python 3.8 and java 8.
It is recommended to use anaconda or miniconda to install python 3.8 and jupyter notebook, and also to install JDK 8(https://www.oracle.com/java/technologies/downloads/#java8), as the spark 3.2.1 documentation says: "Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x). For Python 3.9, Arrow optimization and pandas UDFs might not work due to the supported Python versions in Apache Arrow. Please refer to the latest Python Compatibility page. For Java 11, -Dio.netty.tryReflectionSetAccessible=true is required additionally for Apache Arrow library. This prevents java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available when Apache Arrow uses Netty internally."
Install py4j and pyarrow
conda install py4j pyarrow
If you have pyspark installed, please uninstall. from https://spark.apache.org/downloads.html download the spark file and unzip it?Copy the pyspark folder from the python directory in the Spark folder to the conda package directory (e.g., C:\Users\wjh\miniconda3\Lib\site-packages).
Set environment variables. Add '%SPARK_HOME%\bin' to the path of the system variable
Finally, I'm not familiar with Linux, the problem with your .bash_profile may be the value of PYSPARK_PYTHON, no need to set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS, unless you want to open notebook after typing pyspark in the command. But then you won't be able to access the interactive.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |