'Vertica data into pySpark throws "Failed to find data source"
I have spark 3.2, vertica 9.2.
spark = SparkSession.builder.appName("Ukraine").master("local[*]")\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-jdbc-9.2.1-0.jar')\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-spark-3.2.1.jar')\
.getOrCreate()
table = "test"
db = "myDB"
user = "myUser"
password = "myPassword"
host = "myVerticaHost"
part = "12";
opt = {"host" : host, "table" : table, "db" : db, "numPartitions" : part, "user" : user, "password" : password}
df = spark.read.format("com.vertica.spark.datasource.DefaultSource").options().load()
gives
Py4JJavaError: An error occurred while calling o77.load.
: java.lang.ClassNotFoundException:
Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at
http://spark.apache.org/third-party-projects.html
~/shivamenv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/shivamenv/venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Before this step, i have wget 2 jars into spark jars folder (the ones in sparksession config)
which i obtained from
https://libraries.io/maven/com.vertica.spark:vertica-spark https://www.vertica.com/download/vertica/client-drivers/
not sure what i'm doing wrong here, is there an alternative to the spark jars option?
In the below link -
they mention
Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:
The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib. The JDBC client library is /opt/vertica/java/vertica-jdbc.jar
Should one try to replace local folder jars with these?
Solution 1:[1]
There is no need to replace local folder jars. Once you copy them to spark cluster, you would run spark-shell command with the following options. Please find sample example below.Once a side note, vertica officially supports only spark 2.x with vertica 9.2 version. I hope this helps.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
spark-shell --jars vertica-spark2.1_scala2.11.jar,vertica-jdbc-9.2.1-11.jar
date 18:26:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://dholmes14:4040 Spark context available as 'sc' (master = local[*], app id = local-1597170403068). Spark session available as 'spark'. Welcome to
/ / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 2.4.6 //
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) Type in expressions to have them evaluated. Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.storage._
val df1 = spark.read.format("com.vertica.spark.datasource.DefaultSource").option("host", "").option("port", 5433).option("db", "").option("user", "dbadmin").option("dbschema", "").option("table", "").option("numPartitions", 3).option("LogLevel", "DEBUG").load()
val df2 = df1.filter("column_name between 800055 and 8000126").groupBy("column1", "column2")
spark.time(df2.show())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |