'PySpark error: AnalysisException: 'Cannot resolve column name
I am trying to transform an entire df to a single vector column, using
df_vec = vectorAssembler.transform(df.drop('col200'))
I am being thrown this error:
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Cannot resolve column name "col200" among (col1, col2..
I looked around the internet and found out that the error could be caused because of some white spaces in the column headers. The problem is that there are around 1600 columns, and it's quite a task to check each one of them - especially for white spaces. How do I approach this? It's a df with around 800000 rows, FYI.
by doing df.printSchema(), I don't see any white spaces. Not leading at least. I am pretty positive that none of the column names have any spaces in between too.
At this point, I am totally blocked! Any help would be greatly appreciated.
Solution 1:[1]
That happened to me a couple of times, try this:
tempList = [] #Edit01
for col in df.columns:
new_name = col.strip()
new_name = "".join(new_name.split())
new_name = new_name.replace('.','') # EDIT
tempList.append(new_name) #Edit02
print(tempList) #Just for the sake of it #Edit03
df = df.toDF(*tempList) #Edit04
The code trims and removes all whitespaces from every single column in your Dataframe.
Solution 2:[2]
The following should work:
import re
from pyspark.sql.functions import col
# remove spaces from column names
newcols = [col(column).alias(re.sub('\s*', '', column) \
for column in df.columns]
# rename columns
df = df.select(newcols).show()
EDIT: as a first step, if you just wanted to check which columns have whitespace, you could use something like the following:
space_cols = [column for column in df.columns if re.findall('\s*', column) != []]
Also, check whether there are any characters that are non-alphanumeric (or space):
non_alnum = [column for column in df.columns if re.findall('[^a-zA-Z0-9\s]', column) != []]
Solution 3:[3]
`` or "" help identify the entire column.
df_vec = vectorAssembler.transform(df.drop('`col200`'))
or
df_vec = vectorAssembler.transform(df.drop("col200"))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Anonymous Person |
Solution 2 | |
Solution 3 | MichiganMadeLearner |