'how can I get big string data in pyspark?

I want to take big string paragraph from csv file and put it in dataframe in pyspark like;

data.to_csv('file.csv', index=False,encoding='utf-8',na_rep='Unkown')
df= spark.read.format('csv') \
            .option('header',True) \
            .option("multiline",True).load('file.csv',inferSchema=True)

df.printSchema()

It gives;

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Body: string (nullable = true)
 |-- Tags: string (nullable = true)

and

df.show()

The result is;

+--------------------+--------------------+--------------------+--------------------+
|                  Id|               Title|                Body|                Tags|
+--------------------+--------------------+--------------------+--------------------+
|                   1|How to check if a...|<p>I'd like to ch...|php image-process...|
|                   2|How can I prevent...|<p>In my favorite...|             firefox|
|                   3|R Error Invalid t...|"<p>I am import m...|                null|
|      expert_trai...|                null|                null|                null|
|      expert_data...|                null|                null|                null|
|      rf_model = ...| data=expert_data...|     importance=TRUE|      do.trace=100);|
|                   }|                null|                null|                null|
|       </code></pre>|                null|                null|                null|

When I check the body's 3. data;

print(data['Body'][2])

it gives ;

<p>I am import matlab file and construct a data frame, matlab file contains two columns with and each row maintain a cell that has a matrix, I construct a dataframe to run random forest. But I am getting following error. </p>
<pre><code>Error in model.frame.default(formula = expert_data_frame$t_labels ~ .,  : 
  invalid type (list) for variable 'expert_data_frame$t_labels'
</code></pre>
<p>Here is the code how I import the matlab file and construct the dataframe:</p>
<pre><code>all_exp_traintest &lt;- readMat(all_exp_filepath);
len = length(all_exp_traintest$exp.traintest)/2;
    for (i in 1:len) {
      expert_train_df &lt;- data.frame(all_exp_traintest$exp.traintest[i]);
      labels = data.frame(all_exp_traintest$exp.traintest[i+302]);
      names(labels)[1] &lt;- "t_labels";
      expert_train_df$t_labels &lt;- labels;
      expert_data_frame &lt;- data.frame(expert_train_df);
      rf_model = randomForest(expert_data_frame$t_labels ~., data=expert_data_frame, importance=TRUE, do.trace=100);
    }
</code></pre>
<p>Structure of the Matlab input file</p>
<pre><code>[56x12 double]    [56x1 double]
[62x12 double]    [62x1 double]
[62x12 double]    [62x1 double]
[62x12 double]    [62x1 double].......

I tried

option("multiline", true)

but it doesn't work. Any suggestion? Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source