Pacific-Design.com

    
Home Index

1. Apache Spark

+ 01 Overview

+ 02 Datasets

+ CSV To Parquet

+ Compute New Column

+ Configuration

+ DataFrame

+ JSON

+ Java Spark

+ Join

+ Jupyter

+ NLP

+ Perfomance

+ PySpark

+ Resource Files

+ SBT Project

+ SparkSession

+ Spark Core

+ Tutorial

+ UDF

+ Zeppelin

+ slides

+ spark-shell

+ spark-submit

Apache Spark /

Resilient Distributed Datasets (RDD) - Transformation, Action

/*-----------------------------------------------------------------
    spark-shell -i 1-reduce-group-count.scala
-------------------------------------------------------------------*/

val data = sc.textFile("names.csv")
val rdd = data.map(line => line.split(","))
//val skipFirstRow = text.filter(line => !line.contains("Count")).map(line => line.split(","))


println("\n\n----------- 1. REDUCE BY KEY (string) -------------------")
val keyValue = rdd.map(col => (col(0), col(2)))
val reduced = keyValue.reduceByKey((x,y) => x+","+y ).collect
reduced.foreach(rec => println(rec._1 + " => " + rec._2))

println("\n\n----------- 2. REDUCE BY KEY (integer) ------------------")
val keyValue = rdd.map(col => (col(0), col(2).toInt))
val reduced = keyValue.reduceByKey((x,y) => x+y ).collect
reduced.foreach(rec => println(rec._1 + " => " + rec._2))


println("\n\n----------- 3. GROUP BY KEY (slow) ----------------------")
val keyValue2 = rdd.map(col => (col(0), col(1)))
val grouped = keyValue2.groupByKey.collect
grouped.foreach(rec => println(rec._1 + " => " + rec._2.mkString(",")))


println("\n\n----------- 4. COUNT BY KEY -----------------------------")
val keys = rdd.map(col => (col(0), "something"))
val counted = keys.countByKey
counted.foreach(rec => println(rec._1 + " => " + rec._2))


println("\n\n----------- 5. COUNT BY KEY (parallelize) ---------------")
val data = sc.parallelize(List("Kevin", "Thomas", "Peter", "Kevin", "Samuel")) 
val mapped = data.map(key => (key,1))
val counted = mapped.countByKey
counted.foreach(rec => println(rec._1 + " => " + rec._2))

sc.stop
sys.exit()

Output

The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple. ----------- 1. REDUCE BY KEY (string) ------------------- Joshua => 23000 Alison => 165000 Tomas => 85000 James => 34000 Peter => 65000 Ryan => 55000,78000 Isabella => 74000 Samuel => 76000 Justin => 45000 Kevin => 190000,95000 ----------- 2. REDUCE BY KEY (integer) ------------------ Joshua => 23000 Alison => 165000 Tomas => 85000 James => 34000 Peter => 65000 Ryan => 133000 Isabella => 74000 Samuel => 76000 Justin => 45000 Kevin => 285000 ----------- 3. GROUP BY KEY (slow) ---------------------- Joshua => Calabasas Alison => Santa Monica Tomas => Simi Valley James => Calabasas Peter => Los Angeles Ryan => Calabasas,Valencia Isabella => Woodland Hills Samuel => Woodland Hills Justin => Santa Monica Kevin => Agoura Hills,Calabasas ----------- 4. COUNT BY KEY ----------------------------- Alison => 1 Ryan => 2 Isabella => 1 James => 1 Tomas => 1 Joshua => 1 Samuel => 1 Kevin => 2 Peter => 1 Justin => 1 ----------- 5. COUNT BY KEY (parallelize) --------------- Kevin => 2 Thomas => 1 Peter => 1 Samuel => 1

Input - names.csv

Kevin,Agoura Hills,190000 Tomas,Simi Valley,85000 Peter,Los Angeles,65000 Alison,Santa Monica,165000 Ryan,Calabasas,55000 Justin,Santa Monica,45000 Joshua,Calabasas,23000 Isabella,Woodland Hills,74000 Kevin,Calabasas,95000 Samuel,Woodland Hills,76000 James,Calabasas,34000 Ryan,Valencia,78000

References:

https://databricks.com/blog/category/engineering/spark