'Pyspark dataframe returns different results each time I run
Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe.
Here is the code I am using:
df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.show(df_check.count(),False)
I ran df_check.show() 3 times and the column paciente_id_count gives different values everytime: show results (I cut the tables so It would be easier to compare).
How do I prevent this?
Solution 1:[1]
The .show() does not compute the whole operations.
Maybe you could try the following (if the final number of rows fits in your drive memory):
df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_ id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.toPandas()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jesus Sono |