Home Index

1. Apache Spark

2. Definition

Apache Spark / Definition /

Spark Transformations

  • map() - Apply a function to each element in the DStream and return a DStream of the result.
  • filter() - Return a DStream consisting of only elements that pass the condition passed to filter.
  • groupByKey() - Group values with the same key in each batch.
  • reduceByKey() - Combine values with the same key in each batch.
  • repartition() - Change the number of partitions of the DStream.

RDD - Resilient Distributes Dataset

  • Abstraction for data interaction ( lazy in memory)
  • RDDs are an immutable, distributed collection of elements into partitions
  • RDDs - multiple types

Spark Transformation

Produce a RDD, which is a collection of elements partitioned across the nodes of the cluster. that can be operated on in parallel.
  • map
  • flatMap
  • filter

Spark Actions

Executes various parallel operations on a cluster and produce result.
  • reduce
  • collect
  • count

Spark Driver

  • declares transformation and actions

Apache Spark popular libraries:

  • Spark SQL:
    • Spark SQL provides the capability to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools. Spark SQL allows the users to ETL their data from different formats it’s currently in (like JSON, Parquet, a Database), transform it, and expose it for ad-hoc querying.
  • Spark MLlib:

    • MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
  • Spark GraphX:

    • GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph: a directed multi-graph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
  • Spark Streaming:

    • Spark Streaming can be used for processing the real-time streaming data. This is based on micro batch style of computing and processing. It uses the DStream which is basically a series of RDDs, to process the real-time data.
  • Spark Cassandra Connector:

    • There are also integration adapters with other products like Cassandra (Spark Cassandra Connector) and R (SparkR). With Cassandra Connector, you can use Spark to access data stored in a Cassandra database and perform data analytics on that data.i