'How to select all columns except 2 of them from a large table on pyspark sql?
In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks.
My pyspark sql:
%sql
set hive.support.quoted.identifiers=none;
select a.*, '?!(b.year|b.month)$).+'
from MY_TABLE_A as a
left join
MY_TABLE_B as b
on a.year = b.year and a.month = b.month
I followed hive:select all column exclude two Hive How to select all but one column?
but, it does not work for me. All columns are in the results. I would like to remove the duplicated columns (year and month in the result).
thanks
Solution 1:[1]
In pyspark, you can do something like this:
df.select([col for col in df.columns if c not in {'col1', 'col2', 'col3'}])
where df is the resulting dataframe after the join operation is perfomed.
Solution 2:[2]
set hive.support.quoted.identifiers=none
not supported in Spark.
Instead in Spark set spark.sql.parser.quotedRegexColumnNames=true
to
get same behavior as hive.
Example:
df=spark.createDataFrame([(1,2,3,4)],['id','a','b','c'])
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
#select all columns except a,b
sql("select `(a|b)?+.+` from tmp").show()
#+---+---+
#| id| c|
#+---+---+
#| 1| 4|
#+---+---+
Solution 3:[3]
As of Databricks runtime 9.0, you can use the * except()
command like this:
df = spark.sql("select a.* except(col1, col2, col3) from my_table_a...")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | frosty |
Solution 2 | |
Solution 3 | David Maddox |