I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar
Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('
I am trying to buld a Cube in Kylin. It successfully does the first step and then just keeps running always at 50% Log from the step 2: 2022-05-13 13:54:40,640
I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file
I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at t
I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a
I can't believe I have to ask this here but there seems to be no documentation on what the HWC actually does. All I can find is that it returns a boolean: publi
Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string
My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un
Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da
I am running the delete query with the < (less then) and > (greater then) condition on the timestamp field but we are not getting the desired results. Fir
I have been trying to set up spark to use it further for pyspark library. I installed JDK, Hadoop and spark. Also provided the environment variables correctly.
There are some solutions here Windows Spark Error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils The mentioned
In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks. My pyspark sql: %
I've installed apache spark on my mac with 16 GB of RAM to test my pyspark code locally with small data sets before I test it on a real cluster. I've installed
I wrote a file to parquet containing 1,000,000 rows. When I read the parquet file back, the result is 1,000,000 rows. df = spark.read.parquet(parquet_path) df.
I want to stop my spark instance here once I complete my job running on Jupyter notebook. I did execute spark.stop() at the end, but when I open my terminal, I'
I have code similar to this in Spark(Scala). I would like to know the number of records this code updated/inserted when execute() is complete. Is there a way?
What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,
I'm trying to write a dataframe on AWS (Keyspace), but I'm getting the following messages below: Stack: dfExploded.write.cassandraFormat(table = "table", keyspa