Apache Spark is a distributed streaming service.
Mapper: map a row into another by transforming it. It is not very costly Shuffle: Match multiple rows (reduce, join...). It is a very costly operation.
Flume Java: run the same code on my local machine for debug than the distributed code on the cluster. It is also possible to run on my local machine from pulling data on the distant cluster
DAG: Direct Acyclic Graph
=Spark 2.0: Dataset: One important aspect of a dataset or a RDD is that they are imutable. This helps with caching and simplify error catching during the process.
RDD: - Logical model across distributed Storage - Resilient and immutable - Compile-time type safe - structured or unstructured - Lazy (not materialized before doing an action on them)
RDD are a low level API and does not helps spark optimize.
Dataframe and Dataset are a higher level API
Spark handles dynamic allocation if the app needs more power.
Skew: happens if one job happens on one core and the job is not well spitted. An example using join, if many rows have a key to
null then all that rows will be processed by the same thread and will become the bottleneck. There is always to fix skew.
Skew can be fixed by using a salt on the key and then when spiting the job on a mod-hash of the key.
Nested Structures: used to make joins less costly.
To use the MapReduce paradigm in Spark can be used using the