'Spark RDD: Find the single row that has the highest count and for that row report the month, count and hashtag name. Output Using PrintLn
[Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output using println. So, for the above small example data set the result would be:
Here is a small example of the Twitter data that we will use to illustrate the subtasks below:
Please check the data in the above image
Expected Output: month: 200907, count: 1000, hashtagName: abc
My Input:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
object Main {
def solution(sc: SparkContext) {
// Load each line of the input data
val twitterLines = sc.textFile("Assignment_Data/twitter-small.tsv")
// Split each line of the input data into an array of strings
val twitterdata = twitterLines.map(_.split("\t"))
// TODO: *** Put your solution here ***
val find_max = twitterdata.reduce { (max: Array[String], current: Array[String]) =>
val curCount = current(2).toInt
val maxCount = max(2).toInt
if (curCount > maxCount) current else max
}
Please help, what to do next?
Thanks.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|