'SQL order of execution
Solution 1:[1]
Spark SQL engine is obviously different than the normal SQL engine because it is a distributed SQL engine. The normal SQL order of execution does not applied here because when you execute a query via Spark SQL, the engine converts it into optimized DAG before it is distributed across your worker nodes. The worker nodes then do map, shuffle, and reduce tasks before the result is aggregated and returned to the driver node. Read more about Spark DAG here.
Therefore, there are more than just one selecting, filtering, aggregation happening before it returns any result. You can see it yourself by clicking on Spark job view on the Databricks query result panel and then select Associated SQL Query.
So, when it comes to Spark SQL, I recommend we refer to Spark document which clearly indicates that Having clause can refer to aggregation function by its alias.
Solution 2:[2]
As we know 'having' clause execute before the select one
This affirmation is wrong. The HAVING
clause is used to apply filters in aggregation functions (such as SUM
, AVG
, COUNT
, MIN
and MAX
). Since they need to be calculated BEFORE applying any filter, in fact, the SELECT
statement is done when the HAVING
clause start to be processed.
Even if the previous paragraph was not true, it is important to consider that SQL statements are interpreted as a whole before any processing. Due to this, it doesn't really matter the order of the instructions: the interpreter can link all references so they make sense in runtime.
So it would be perfectly feasible to put the HAVING
clause before the SELECT
or in any part of the instruction, because this is just a syntax decision. Currently, HAVING
clause is after GROUP BY
clause because someone decided that this syntax makes more sense in SQL.
Finally, you should consider that allowing you to reference something by an alias is much more a language feature than a rational on how the instruction is processed.
Solution 3:[3]
the order of exution is
- Getting Data (From, Join)
- Row Filter (Where)
- Grouping (Group by)
- Group Filter (Having)
- Return Expressions (Select)
- Order & Paging (Order by & Limit / Offset)
I still don't get, why you are asking about, syntactially your seelect qiery is correct, but if it the correct result we can not know
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | nbk |