'Remove special character from a column in dataframe
I am trying to remove a special character (å) from a column in a dataframe.
My data looks like:
ClientID,PatientID
AR0001å,DH_HL704221157198295_91
AR00022,DH_HL704221157198295_92
My original data is approx 8TB in size from which I need to get rid of this special character.
Code to load data:
reader.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("charset", "ISO-8859-1")
.schema(schema)
.csv(path)
After loading into dataframe when I do df.show()
it shows:
+--------+--------------------+
|ClientID| PatientID|
+--------+--------------------+
|AR0001Ã¥|DH_HL704221157198...|
|AR00022 |DH_HL704221157198...|
+--------+--------------------+
Code I used to try to replace this character:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "\å", ""));
But this didn't work. While loading the data in dataframe if I change the charset to "UTF-8" it works.
I am not able to find a solution with the current charset (ISO-8859-1).
Solution 1:[1]
Some things to note,
- Make sure to assign the result to a new variable and use that afterwards
- You do not need to escape "å" with
\
colName
in the command should beClientId
orPatientID
If you did all these things, then I would suggest to, instead of matching on "å", try matching on the characters you want to keep. For example, for the ClientID
column,
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^A-Z0-9_]", ""));
Another approach would be to convert the UTF-8 character "å" to it's ISO-8859-1 equivalent and replace with the resulting string.
String escapeChar = new String("å".getBytes("UTF-8"), "ISO-8859-1");
Solution 2:[2]
The below command will remove all the special characters and will keep all the lower/upper case alphabets and all the numbers in the string:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^a-zA-Z0-9]", ""));
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Aditya Gupta |