'PySpark error: AnalysisException: 'Cannot resolve column name

I am trying to transform an entire df to a single vector column, using

df_vec = vectorAssembler.transform(df.drop('col200'))

I am being thrown this error:

File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: 'Cannot resolve column name "col200" among (col1, col2..

I looked around the internet and found out that the error could be caused because of some white spaces in the column headers. The problem is that there are around 1600 columns, and it's quite a task to check each one of them - especially for white spaces. How do I approach this? It's a df with around 800000 rows, FYI.

by doing df.printSchema(), I don't see any white spaces. Not leading at least. I am pretty positive that none of the column names have any spaces in between too.

At this point, I am totally blocked! Any help would be greatly appreciated.



Solution 1:[1]

That happened to me a couple of times, try this:

tempList = [] #Edit01
    for col in df.columns:
        new_name = col.strip()
        new_name = "".join(new_name.split())
        new_name = new_name.replace('.','') # EDIT
        tempList.append(new_name) #Edit02
print(tempList) #Just for the sake of it #Edit03

df = df.toDF(*tempList) #Edit04

The code trims and removes all whitespaces from every single column in your Dataframe.

Solution 2:[2]

The following should work:

import re
from pyspark.sql.functions import col

# remove spaces from column names
newcols = [col(column).alias(re.sub('\s*', '', column) \
for column in df.columns]

# rename columns
df = df.select(newcols).show()

EDIT: as a first step, if you just wanted to check which columns have whitespace, you could use something like the following:

space_cols = [column for column in df.columns if re.findall('\s*', column) != []]

Also, check whether there are any characters that are non-alphanumeric (or space):

non_alnum = [column for column in df.columns if re.findall('[^a-zA-Z0-9\s]', column) != []]

Solution 3:[3]

`` or "" help identify the entire column.

df_vec = vectorAssembler.transform(df.drop('`col200`'))

or

df_vec = vectorAssembler.transform(df.drop("col200"))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Anonymous Person
Solution 2
Solution 3 MichiganMadeLearner