'What is the difference beween writting logic in Python or plain SQL in Foundry's SQL Transforms?
Foundry's SQL Transforms offers at least 2 ways to write logic code: Python and plain text SQL.
I already noticed some differences:
- SQL does not allow incremental computation,
- SQL does not allow to use variables, constantes or factorised functions,
- SQL does not allow to add parameters to output dataset such "column description".
I'm I wrong on some points and is there other differences (e.g. execution time, consumed ressources) ?
Solution 1:[1]
This is a bit of a subjective question, but let me give it a shot. The purpose of both SQL and Python when building transforms is to build a Spark Query, which will execute and return something. That something will be saved in your output dataset.
SQL as the name says is a "structured query language" which generates a query directly, while Python is a traditional programming language which needs the help from a Python library called PySpark to generate queries.
While SQL will generate your query plan and jump straight away to the executors, Python enables you to run code on the driver, which in turn enables you to use the language tooling to help you.
So the main difference is that you can write tooling when using python, and you can't while using sql. The things you list above "allowing incremental computation", "column descriptions", ... etc. Are all things that are possible because Python is a regular programming language, and Palantir foundry has shipped with libraries to do those, but you can also write your own if you want. Unlike SQL which is just the query language, so it has no knowledge of libraries, or about foundry itself. Python also makes the whole code base much easier to maintain, test and expand.
There are too many others to list here, so I invite you to experiment with Python transforms, and read through the docs in https://www.palantir.com/docs/foundry/transforms-python/transforms-python-api/ . Here's some of my favourites: Data Expectations, Unit testing, Capturing re-usable pyspark logic as python libraries, multi output transforms, consuming publicly available open source libraries for data manipulation, accessing the dataset file system for manually parsing when needed...
p.s.: You can also do transforms in other languages, such as java.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |