'Spark read csv option to escape delimiter

Have an input csv like the one below, Need to escape the delimiter within one of the columns (2nd column):

f1|f2|f3
v1|v2\|2|v3
x1|x2\|2|x3

spark.read.option("header", "true").option("delimiter", "|").csv("file.csv.gz") Intended to obtain the result as below with the command..

+---+-----+---+
|f1 |f2   |f3 |
+---+-----+---+
|v1 |v2\|2|v3 |
|x1 |x2\|2|x3 |
+---+-----+---+

but obtained this instead:

+---+---+---+
|f1 |f2 |f3 |
+---+---+---+
|v1 |v2\|2  |
|x1 |x2\|2  |
+---+---+---+

Tried playing around with the options mentioned in the documentation: https://spark.apache.org/docs/2.4.5/api/scala/#org.apache.spark.sql.DataFrameReader .

But not able to figure out an easy way to escape the value delimiter.

Also, the data is huge and didn't find an efficient way to read it as text, replace and then apply a schema.



Solution 1:[1]

I have a work-around to process the special character "\|" in the second column. It may not be ideal, but may serve as a solution. The code will be as follows:

import re
import os
import findspark
findspark.init() 
from pyspark.sql import SparkSession

current_folder = os.getcwd()
input_file = os.path.join(current_folder, "input.csv")
""" the input.csv file are as following
f1|f2|f3
v1|v2\|2|v3
x1|x2\|2|x3
y1|y2\|2|y3
z1|z2\|2|z3
"""

def delimit_line(line): 
    line_entry_list = re.split("(?<!\\\\)\|", line)
    return line_entry_list
    
spark = SparkSession.builder.appName("read_data").config("spark.master", "local").getOrCreate()
raw_lines_RDD = spark.sparkContext.textFile(input_file)  
raw_lines_RDD = raw_lines_RDD.map(lambda line: delimit_line(line))

print ("\n after delimiter : ", raw_lines_RDD.collect())
    
column_line = raw_lines_RDD.filter(lambda line_entry_list: line_entry_list[0].startswith("f1"))    
content_lines =  raw_lines_RDD.filter(lambda line_entry_list: not line_entry_list[0].startswith("f1"))  
print("column line: ", column_line.collect()) 
print("content line: ", content_lines.collect())  

The output will be as follows:

  after delimiter :  [['f1', 'f2', 'f3'], ['v1', 'v2\\|2', 'v3'], ['x1', 'x2\\|2', 'x3'], ['y1', 'y2\\|2', 'y3'], ['z1', 'z2\\|2', 'z3']]
column line:  [['f1', 'f2', 'f3']]
content line:  [['v1', 'v2\\|2', 'v3'], ['x1', 'x2\\|2', 'x3'], ['y1', 'y2\\|2', 'y3'], ['z1', 'z2\\|2', 'z3']]

We have column title and content, then we can convert the list in RDD into a dataframe.

In the code, we use regular expression to exclude the cases with "\" preceding "|" for the string split. Hope this solution helps.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 XYZ