'how can I get big string data in pyspark?
I want to take big string paragraph from csv file and put it in dataframe in pyspark like;
data.to_csv('file.csv', index=False,encoding='utf-8',na_rep='Unkown')
df= spark.read.format('csv') \
.option('header',True) \
.option("multiline",True).load('file.csv',inferSchema=True)
df.printSchema()
It gives;
root
|-- Id: string (nullable = true)
|-- Title: string (nullable = true)
|-- Body: string (nullable = true)
|-- Tags: string (nullable = true)
and
df.show()
The result is;
+--------------------+--------------------+--------------------+--------------------+
| Id| Title| Body| Tags|
+--------------------+--------------------+--------------------+--------------------+
| 1|How to check if a...|<p>I'd like to ch...|php image-process...|
| 2|How can I prevent...|<p>In my favorite...| firefox|
| 3|R Error Invalid t...|"<p>I am import m...| null|
| expert_trai...| null| null| null|
| expert_data...| null| null| null|
| rf_model = ...| data=expert_data...| importance=TRUE| do.trace=100);|
| }| null| null| null|
| </code></pre>| null| null| null|
When I check the body's 3. data;
print(data['Body'][2])
it gives ;
<p>I am import matlab file and construct a data frame, matlab file contains two columns with and each row maintain a cell that has a matrix, I construct a dataframe to run random forest. But I am getting following error. </p>
<pre><code>Error in model.frame.default(formula = expert_data_frame$t_labels ~ ., :
invalid type (list) for variable 'expert_data_frame$t_labels'
</code></pre>
<p>Here is the code how I import the matlab file and construct the dataframe:</p>
<pre><code>all_exp_traintest <- readMat(all_exp_filepath);
len = length(all_exp_traintest$exp.traintest)/2;
for (i in 1:len) {
expert_train_df <- data.frame(all_exp_traintest$exp.traintest[i]);
labels = data.frame(all_exp_traintest$exp.traintest[i+302]);
names(labels)[1] <- "t_labels";
expert_train_df$t_labels <- labels;
expert_data_frame <- data.frame(expert_train_df);
rf_model = randomForest(expert_data_frame$t_labels ~., data=expert_data_frame, importance=TRUE, do.trace=100);
}
</code></pre>
<p>Structure of the Matlab input file</p>
<pre><code>[56x12 double] [56x1 double]
[62x12 double] [62x1 double]
[62x12 double] [62x1 double]
[62x12 double] [62x1 double].......
I tried
option("multiline", true)
but it doesn't work. Any suggestion? Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|