'What are broadcast variables? What problems do they solve?

I am going through Spark Programming guide that says:

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Considering the above, what are the use cases of broadcast variables? What problems do broadcast variables solve?

When we create any broadcast variable like below, the variable reference, here it is broadcastVar available in all the nodes in the cluster?

val broadcastVar = sc.broadcast(Array(1, 2, 3))

How long these variables available in the memory of the nodes?



Solution 1:[1]

If you have a huge array that is accessed from Spark Closures, for example, some reference data, this array will be shipped to each spark node with closure. For example, if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).

If you use broadcast, it will be distributed once per node using an efficient p2p protocol.

val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)

And some RDD

val rdd: RDD[Int] = ???

In this case, array will be shipped with closure each time

rdd.map(i => array.contains(i))

and with broadcast, you'll get a huge performance benefit

rdd.map(i => broadcasted.value.contains(i))

Solution 2:[2]

Broadcast variables are used to send shared data (for example application configuration) across all nodes/executors.

The broadcast value will be cached in all the executors.

Sample scala code creating broadcast variable at driver:

val broadcastedConfig:Broadcast[Option[Config]] = sparkSession.sparkContext.broadcast(objectToBroadcast)

Sample scala code receiving broadcasted variable at executor side:

val config =  broadcastedConfig.value

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ankur Chavda
Solution 2