'Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate()


import spark.implicits._
case class Something(id: Int, batchId: Option[String], div: String)
val sth1 = Something(1, Some("1000"), "10")
val sth2 = Something(2, Some("1000"), "10")
val sth3 = Something(3, Some("1000"), "10")
val sth4 = Something(4, Some("1000"), "10")
val ds = Seq(sth1, sth2, sth3, sth4).toDS()
ds.write.mode("overwrite").option("path", "loacl_path").bucketBy(3, "id").saveAsTable("Tmp")

I go to the local_path where it stores the data but I only find two parquet files. I wonder why it doesn't create 3 parquet files which is the number of bucket.

I have also tried bucket number equals to 1 or 2, it does impact the number of parquet files stored in local path. When bucket numer is 1, then there is only 1 parquet file, similarly for the case when it equals to 2.



Solution 1:[1]

You should use Dataset.repartition operator to control the number of output files.

You can still have the bucketBy with combination with repartition, but bucketBy has different use - avoiding shuffles in joins when they use the join keys matching the bucketing keys.

ds.repartition(3)
  .write
  .mode("overwrite")
  .option("path", "loacl_path")
  .bucketBy(3, "id")
  .saveAsTable("Tmp")

Solution 2:[2]

bucketBy is not probably what you're looking for (if you're expecting your data to be written inside 3 parquet files). when you use bucketBy, you define the column names, and a hash function is responsible to divide your data into number of buckets you specified, it doesn't necessarily mean that they should be saved in n files. This is used to boost your querying performance (something similar to indexing, not equal). Now I haven't tried this yet, but what you're looking for probably is repartition method.

df.repartition(3)
  .write.mode(SaveMode.Overwrite)
  .option("path", "local_path")
  .saveAsTable("Tmp")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jacek Laskowski
Solution 2 AminMal