Wednesday, December 28, 2016

Spark - RDD

First - RDD stands for Resilient Distributed Dataset.

Spark RDD is a distributed collection of data. This distributed collection is usually created in two ways: by external data ( a file, data from HDFS) or by distributing a collection of object ( eg: List/Set) in driver program.

Scala code to create RDD:

  1. External data RDD: val lines = sc.textFile("input.txt") 
  2. Distribute Collection RDD: val nums = sc.parallelize(List(1, 2, 3, 4))
*sc - is SparkContext object 

Now we have RDDs created in our driver program. Once RDD created, we do computation on theses.
Two ways of computation can be performed on RDD:

  • Transformation: Transformation results in new RDDs. Commonly used Transformations:
    • flatMap(): apply function to each element in RDD and returns cotent of iterator returned as new RDD.
    • filter(): returns an RDD that contains only elements that pass filter condition
    • map():  returns an RDD applying function to each element in RDD 
    • distinct(): removes duplicate elements.
    • Union: produces an RDD with contianning elements from both ...
  • Actions: Actions are the operations that return some value or write data. Commonly used Actions:
    • collect(): returns all elements in RDD
    • count(): Number of elements in RDD
    • foreach(): iterate over the elements in RDD 
    • top(num): returns top num elements from RDD ... 

No comments:

Post a Comment