'TypeError while tokenizing a column in Spark dataframe
I'm trying to tokenize a 'string' column from a spark dataset.
The spark dataframe is as follows:
df:
index ---> Integer
question ---> String
This is how I'm using the spark tokenizer:
Quest = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol=Quest, outputCol="question_parts")
But I get the following error:
Invalid param value given for param "inputCol". Could not convert <class 'pyspark.sql.dataframe.DataFrame'> to string type
I also substituted the first line of my code with following codes, but they didn't resolve this error either:
Quest = df.select(concat_ws(" ",col("question")))
and
Quest= df.withColumn("question", concat_ws(" ",col("question")))
What's my mistake here?
Solution 1:[1]
The mistake is the second line. df.withColumn()
returns a dataframe with the column you just created appended. In the second line, inputCol="question"
should give you what you need. You then need to transform your dataframe using the tokenizer.
Try:
df = df.withColumn("Question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol="Question", outputCol="question_parts")
tokenizer.Transform(df)
Edit:
I'm not sure you intended to create a new column in the first line - I've changed the column name in the withColumn
method from "question"
to "Question"
to replace the existing column. It also looks from your data like the column is already in string format - if so this step is not necessary.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |