Category "apache-spark-sql"

Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code

I am stuck in a very odd situation related to Hbase design i would say. Hbase version >> Version 2.1.0-cdh6.2.1 So, the problem statement is, in Hbase, w

Flatten Nested Json String Column Table into tabular format

I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f

Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate() import spark.implicits._ case class Someth

regex_extract_all not working with spark sql

I'm using databricks notebook to extract all field occurrences from a text column using the regexp_extract_all function. Here is the input: field_map#'IFDSIMP.7

Sharing an Oracle table among Spark Nodes using Python

I have an huge Oracle table to process, so I define a list of where clauses to read by each Spark node. In the middle of the processing I need to join the data

PySpark SQL forbid certain functions/operators

Given a PySpark SQL such as park.sql('''select 10%4 as hello ''') what is the best way to throw an exception anytime an operator % is used?

Is there are difference between PySpark and SparkSQL? If so, what's the difference?

Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job. However, I'm unable to see many differences outside

How do I select the columns of a table in databricks sql?

I can use: show columns in table_name but this does not allow me to use the output in a query? This throws an error: SELECT * FROM show columns in table_name

Error while reading date and datetime column from mariadb via spark

I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar

Scala spark UDF function that takes input and puts it in an Array

I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a

spark how to convert a json string to a struct column without schema

Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string

(py)spark weighted average taking account of missing values

Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da

How do I Insert Overwrite with parquet format?

I am have two parquet file in azure data lake gen2 I want to Insert Overwrite onw with other. I was trying the same in azure data bricks by doing below. Reading

Processing data from a kafka stream using Pyspark

What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,

How to apply a pandas geocode function to Pyspark column

Table is like this id ADDRESS 0 6101 SUMMITVIEW AVE STE 200 YAKIMA 1 527 CEDAR WAY SUITE 105 OAKMONT 2 1700 N ROSE AVE SUITE 460 OXNARD 3 1275 YORK AVE NEW YOR

Databricks local test fail with java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception: Exc

Insert Overwrite in data bricks overwriting complete data in table?

I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by us

Spark/Scala approximate group by

Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?

How to use Apache Spark to query Hive table with Kerberos?

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data pro

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr