'Vertica data into pySpark throws "Failed to find data source"

I have spark 3.2, vertica 9.2.

spark = SparkSession.builder.appName("Ukraine").master("local[*]")\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-jdbc-9.2.1-0.jar')\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-spark-3.2.1.jar')\
.getOrCreate()

table = "test"
db = "myDB"
user = "myUser"
password = "myPassword"
host = "myVerticaHost"
part = "12";

opt = {"host" : host, "table" : table, "db" : db, "numPartitions" : part, "user" : user, "password" : password}

df = spark.read.format("com.vertica.spark.datasource.DefaultSource").options().load()

gives

Py4JJavaError: An error occurred while calling o77.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at
http://spark.apache.org/third-party-projects.html

~/shivamenv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~/shivamenv/venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Before this step, i have wget 2 jars into spark jars folder (the ones in sparksession config)

which i obtained from

https://libraries.io/maven/com.vertica.spark:vertica-spark https://www.vertica.com/download/vertica/client-drivers/

not sure what i'm doing wrong here, is there an alternative to the spark jars option?

In the below link -

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SparkConnector/GettingTheSparkConnector.htm?tocpath=Integrating%20with%20Apache%20Spark%7C_____1

they mention

Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:

The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib. The JDBC client library is /opt/vertica/java/vertica-jdbc.jar

Should one try to replace local folder jars with these?



Solution 1:[1]

There is no need to replace local folder jars. Once you copy them to spark cluster, you would run spark-shell command with the following options. Please find sample example below.Once a side note, vertica officially supports only spark 2.x with vertica 9.2 version. I hope this helps.

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm

spark-shell --jars vertica-spark2.1_scala2.11.jar,vertica-jdbc-9.2.1-11.jar

date 18:26:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://dholmes14:4040 Spark context available as 'sc' (master = local[*], app id = local-1597170403068). Spark session available as 'spark'. Welcome to


/ / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 2.4.6 //

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) Type in expressions to have them evaluated. Type :help for more information.

scala> import org.apache.spark.sql.SparkSession

import org.apache.spark.storage._

val df1 = spark.read.format("com.vertica.spark.datasource.DefaultSource").option("host", "").option("port", 5433).option("db", "").option("user", "dbadmin").option("dbschema", "").option("table", "").option("numPartitions", 3).option("LogLevel", "DEBUG").load()

val df2 = df1.filter("column_name between 800055 and 8000126").groupBy("column1", "column2")

spark.time(df2.show())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1