1. Apache Spark
2. 01 Overview
Apache Spark / 01 Overview / |
- map() - Apply a function to each element in the DStream and return a DStream of the result.
- filter() - Return a DStream consisting of only elements that pass the condition passed to filter.
- groupByKey() - Group values with the same key in each batch.
- reduceByKey() - Combine values with the same key in each batch.
- repartition() - Change the number of partitions of the DStream.
RDD - Resilient Distributes Dataset
- Abstraction for data interaction ( lazy in memory)
- RDDs are an immutable, distributed collection of elements into partitions
- RDDs - multiple types
Produce a RDD, which is a collection of elements partitioned across the nodes of the cluster.
that can be operated on in parallel.
Executes various parallel operations on a cluster and produce result.
- declares transformation and actions
Apache Spark popular libraries: