'Is there any way to unnest bigquery columns in databricks in single pyspark script

I am trying to connect bigquery using databricks latest version(7.1+, spark 3.0) with pyspark as script editor/base language.

We ran a below pyspark script to fetch data from bigquery table to databricks

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName('bq')
    .master('local[4]')
    .config('parentProject', 'google-project-ID')
    .config('spark.jars', 'jarlocation.jar') \
    .getOrCreate()
)
df = spark.read.format("bigquery").option("credentialsFile", "file path") \
  .option("parentProject", "google-project-ID") \
  .option("project", "Dataset-Name") \
  .option("table","dataset.schema.tablename") \
  .load()

After running the script, when we tried to view the data, we were able to fetch the data in nested format.

{"visitId":"1607519947"},
{"visitStartTime":"1607519947"},
{"date":"20201209"},
{"totals":{"visits": 1, "hits": 1, "pageviews": 1, "timeOnSite": null, "bounces": 1, "transactions": null, "transactionRevenue": null, "newVisits": 1, "screenviews": null, "uniqueScreenviews": null, "timeOnScreen": null, "totalTransactionRevenue": null, "sessionQualityDim": 0}},
{"hits": [{"hitNumber": 1, "time": 0, "hour": 14, "minute": 19, "isExit": true, "referer": null, 
"page": {"pagePath": "/nieuws/Post-hoc-analyse-naar-KRd-bij-18-maanden-na-randomisatie", "hostname": "www.amgenhematologie.nl", "pagePathLevel4": ""}, 
"transaction": {"transactionId": null, "transactionRevenue": null, "transactionTax": null, "transactionShipping": null, "affiliation": null},
"item": {"transactionId": null, "productName": null, "productCategory": null, "productSku": null, "itemQuantity": null, "itemRevenue": null, "currencyCode": "(not set)", "localItemRevenue": null}, 
"eventInfo": null, 
"product": [], 
"promotion": [], 
"promotionActionInfo": null, "refund": null, 
"eCommerceAction": {"action_type": "0", "step": 1, "option": null}, 
"experiment": [], 
"publisher": null, 
"customVariables": [], 
"customDimensions": [], 
"customMetrics": [], 
"type": "PAGE", 
"social": {"socialInteractionNetwork": null, "socialInteractionAction": null, "socialInteractions": null, "socialInteractionTarget": null, "socialNetwork": "(not set)", "uniqueSocialInteractions": null, "hasSocialSourceReferral": "No", "socialInteractionNetworkAction": " : "}, 
"dataSource": "web", 
"publisher_infos": []}]}

Sample Nested Dataframe

Above is the sample data for nested data format.

In this, first 3 columns visitId, visitStartTime and date are direct columns

4th column Totals is in nested format, which needs unnesting in a format like totals.visits, totals.hits etc, as a separate column header like 1st 3 columns with its value

Same goes to 5th column, which has multiple nested dictionaries and should unnest each column inside dictionary as a separate column header which I mentioned for above 4th column.

Is there anyway to unnest the data in pyspark while reading the data from bigquery directly ?

Helps will be appreciated. Thanks in Advance!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source