'Is there any way to unnest bigquery columns in databricks in single pyspark script
I am trying to connect bigquery using databricks latest version(7.1+, spark 3.0) with pyspark as script editor/base language.
We ran a below pyspark script to fetch data from bigquery table to databricks
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('parentProject', 'google-project-ID')
.config('spark.jars', 'jarlocation.jar') \
.getOrCreate()
)
df = spark.read.format("bigquery").option("credentialsFile", "file path") \
.option("parentProject", "google-project-ID") \
.option("project", "Dataset-Name") \
.option("table","dataset.schema.tablename") \
.load()
After running the script, when we tried to view the data, we were able to fetch the data in nested format.
{"visitId":"1607519947"},
{"visitStartTime":"1607519947"},
{"date":"20201209"},
{"totals":{"visits": 1, "hits": 1, "pageviews": 1, "timeOnSite": null, "bounces": 1, "transactions": null, "transactionRevenue": null, "newVisits": 1, "screenviews": null, "uniqueScreenviews": null, "timeOnScreen": null, "totalTransactionRevenue": null, "sessionQualityDim": 0}},
{"hits": [{"hitNumber": 1, "time": 0, "hour": 14, "minute": 19, "isExit": true, "referer": null,
"page": {"pagePath": "/nieuws/Post-hoc-analyse-naar-KRd-bij-18-maanden-na-randomisatie", "hostname": "www.amgenhematologie.nl", "pagePathLevel4": ""},
"transaction": {"transactionId": null, "transactionRevenue": null, "transactionTax": null, "transactionShipping": null, "affiliation": null},
"item": {"transactionId": null, "productName": null, "productCategory": null, "productSku": null, "itemQuantity": null, "itemRevenue": null, "currencyCode": "(not set)", "localItemRevenue": null},
"eventInfo": null,
"product": [],
"promotion": [],
"promotionActionInfo": null, "refund": null,
"eCommerceAction": {"action_type": "0", "step": 1, "option": null},
"experiment": [],
"publisher": null,
"customVariables": [],
"customDimensions": [],
"customMetrics": [],
"type": "PAGE",
"social": {"socialInteractionNetwork": null, "socialInteractionAction": null, "socialInteractions": null, "socialInteractionTarget": null, "socialNetwork": "(not set)", "uniqueSocialInteractions": null, "hasSocialSourceReferral": "No", "socialInteractionNetworkAction": " : "},
"dataSource": "web",
"publisher_infos": []}]}
Above is the sample data for nested data format.
In this, first 3 columns visitId, visitStartTime and date are direct columns
4th column Totals is in nested format, which needs unnesting in a format like totals.visits, totals.hits etc, as a separate column header like 1st 3 columns with its value
Same goes to 5th column, which has multiple nested dictionaries and should unnest each column inside dictionary as a separate column header which I mentioned for above 4th column.
Is there anyway to unnest the data in pyspark while reading the data from bigquery directly ?
Helps will be appreciated. Thanks in Advance!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|