'Spark/Scala approximate group by

Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?



Solution 1:[1]

I am not sure you are looking for these...

approx_count_distinct and countDistinct

are the things available wtih spark api

there is no approx_count_groupby

Examples :

package examples

import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession

object CountAgg extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)


  val spark = SparkSession.builder.appName(getClass.getName)
    .master("local[*]").getOrCreate

  import spark.implicits._
  import org.apache.spark.sql.functions._
  val df =
    Seq(("PAGE1","VISITOR1"),
      ("PAGE1","VISITOR1"),
      ("PAGE2","VISITOR1"),
      ("PAGE2","VISITOR2"),
      ("PAGE2","VISITOR1"),
      ("PAGE1","VISITOR1"),
      ("PAGE1","VISITOR2"),
      ("PAGE1","VISITOR1"),
      ("PAGE1","VISITOR2"),
      ("PAGE1","VISITOR1"),
      ("PAGE2","VISITOR2"),
      ("PAGE1","VISITOR3")
    ).toDF("Page", "Visitor")
  println("groupby abd count example ")
  df.groupBy($"page").agg(count($"visitor").as("count")).show
  println("group by and countDistinct")
    df.select("page","visitor")
    .groupBy('page)
    .agg( countDistinct('visitor)).show
  println("group by and approx_count_distinct")
  df.select("page","visitor")
    .groupBy('page)
    .agg( approx_count_distinct('visitor)).show

}

Result

+-----+-----+
| page|count|
+-----+-----+
|PAGE2|    4|
|PAGE1|    8|
+-----+-----+

group by and countDistinct
+-----+-----------------------+
| page|count(DISTINCT visitor)|
+-----+-----------------------+
|PAGE2|                      2|
|PAGE1|                      3|
+-----+-----------------------+

group by and approx_count_distinct
[2020-04-06 01:04:24,488] WARN Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. (org.apache.spark.util.Utils:66)
+-----+------------------------------+
| page|approx_count_distinct(visitor)|
+-----+------------------------------+
|PAGE2|                             2|
|PAGE1|                             3|
+-----+------------------------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thebluephantom