'Spark/Scala approximate group by
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
Solution 1:[1]
I am not sure you are looking for these...
approx_count_distinct
and countDistinct
are the things available wtih spark api
there is no approx_count_groupby
Examples :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object CountAgg extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
import org.apache.spark.sql.functions._
val df =
Seq(("PAGE1","VISITOR1"),
("PAGE1","VISITOR1"),
("PAGE2","VISITOR1"),
("PAGE2","VISITOR2"),
("PAGE2","VISITOR1"),
("PAGE1","VISITOR1"),
("PAGE1","VISITOR2"),
("PAGE1","VISITOR1"),
("PAGE1","VISITOR2"),
("PAGE1","VISITOR1"),
("PAGE2","VISITOR2"),
("PAGE1","VISITOR3")
).toDF("Page", "Visitor")
println("groupby abd count example ")
df.groupBy($"page").agg(count($"visitor").as("count")).show
println("group by and countDistinct")
df.select("page","visitor")
.groupBy('page)
.agg( countDistinct('visitor)).show
println("group by and approx_count_distinct")
df.select("page","visitor")
.groupBy('page)
.agg( approx_count_distinct('visitor)).show
}
Result
+-----+-----+
| page|count|
+-----+-----+
|PAGE2| 4|
|PAGE1| 8|
+-----+-----+
group by and countDistinct
+-----+-----------------------+
| page|count(DISTINCT visitor)|
+-----+-----------------------+
|PAGE2| 2|
|PAGE1| 3|
+-----+-----------------------+
group by and approx_count_distinct
[2020-04-06 01:04:24,488] WARN Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. (org.apache.spark.util.Utils:66)
+-----+------------------------------+
| page|approx_count_distinct(visitor)|
+-----+------------------------------+
|PAGE2| 2|
|PAGE1| 3|
+-----+------------------------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | thebluephantom |