'Pyspark dataframe returns different results each time I run

Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe.

Here is the code I am using:

df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.show(df_check.count(),False)

I ran df_check.show() 3 times and the column paciente_id_count gives different values everytime: show results (I cut the tables so It would be easier to compare).

How do I prevent this?



Solution 1:[1]

The .show() does not compute the whole operations.

Maybe you could try the following (if the final number of rows fits in your drive memory):

df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_ id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.toPandas()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jesus Sono