'Why spark bucket number not equal to the number of files in the partition?
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate()
import spark.implicits._
case class Something(id: Int, batchId: Option[String], div: String)
val sth1 = Something(1, Some("1000"), "10")
val sth2 = Something(2, Some("1000"), "10")
val sth3 = Something(3, Some("1000"), "10")
val sth4 = Something(4, Some("1000"), "10")
val ds = Seq(sth1, sth2, sth3, sth4).toDS()
ds.write.mode("overwrite").option("path", "loacl_path").bucketBy(3, "id").saveAsTable("Tmp")
I go to the local_path where it stores the data but I only find two parquet files. I wonder why it doesn't create 3 parquet files which is the number of bucket.
I have also tried bucket number equals to 1 or 2, it does impact the number of parquet files stored in local path. When bucket numer is 1, then there is only 1 parquet file, similarly for the case when it equals to 2.
Solution 1:[1]
You should use Dataset.repartition
operator to control the number of output files.
You can still have the bucketBy
with combination with repartition
, but bucketBy
has different use - avoiding shuffles in joins when they use the join keys matching the bucketing keys.
ds.repartition(3)
.write
.mode("overwrite")
.option("path", "loacl_path")
.bucketBy(3, "id")
.saveAsTable("Tmp")
Solution 2:[2]
bucketBy is not probably what you're looking for (if you're expecting your data to be written inside 3 parquet files). when you use bucketBy, you define the column names, and a hash function is responsible to divide your data into number of buckets you specified, it doesn't necessarily mean that they should be saved in n
files. This is used to boost your querying performance (something similar to indexing, not equal). Now I haven't tried this yet, but what you're looking for probably is repartition
method.
df.repartition(3)
.write.mode(SaveMode.Overwrite)
.option("path", "local_path")
.saveAsTable("Tmp")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jacek Laskowski |
Solution 2 | AminMal |